Discussion:
[fpc-devel] new string - question on usage
Martin
2011-10-10 17:12:55 UTC
Permalink
With fpc trunk strings are now codepage aware.

I currently face the issue, that lot's of old code just use "var foo:
string". or sometimes explicit "ansistring". No idea what encoding that
stores, put it does not seem to be utf8.

If I pass such a string (which contains utf8, but seems not to be marked
as such) to a function declared as
function Bar(val: Utf8String): Utf8String;
then I can see that it gets converted (which of course renders the data
completely broken).

Now until I can convert all the strings to Utf8String, and until I can
test all of that.... I need some work around for the very few places
where such a call happens.

Is there a way to pass the string as argument but to suppress conversion
(but keep the function declaration as it is with utf8string)?
Luiz Americo Pereira Camara
2011-10-10 20:11:30 UTC
Permalink
Post by Martin
With fpc trunk strings are now codepage aware.
string". or sometimes explicit "ansistring". No idea what encoding
that stores, put it does not seem to be utf8.
If I pass such a string (which contains utf8, but seems not to be
marked as such) to a function declared as
function Bar(val: Utf8String): Utf8String;
then I can see that it gets converted (which of course renders the
data completely broken).
Now until I can convert all the strings to Utf8String, and until I can
test all of that.... I need some work around for the very few places
where such a call happens.
Is there a way to pass the string as argument but to suppress
conversion (but keep the function declaration as it is with utf8string)?
AFAIK the fpc feature is incomplete and buggy so it will not work anyway
and making workarounds to it now is a bad decision. Changes in Lazarus
must be done after the dust settle in fpc side.

When this time comes, this is what i see:

1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
2- It should have (dont know if currently has) a compiler switch to
change the default code page to UTF8 or whatever, so all variables with
type String will map to UTF8String.
3- The UTF8String/AnsiString type should be reserved where strictly
necessary like libraries that require UTF8 or RTL interfacing

The bug that you cited wont happen if you use the feature of point 2.
But this wont work properly anyway since the unicode RTL/classes is not
done yet.

Luiz
Jonas Maebe
2011-10-10 20:56:37 UTC
Permalink
1- Most of LCL must be code page agnostic, so not use UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode delphi} and {$mode objfpc}. In a future delphiunicode mode or something like that string will be unicodestring, but that's not "code-page agnostic" either. The only somewhat code page agnostic string type is RawByteString.
2- It should have (dont know if currently has) a compiler switch to change the default code page to UTF8 or whatever, so all variables with type String will map to UTF8String.
I doubt that such a feature will be added. If you want that, declare your own string type with whatever default code page you want to use and use that type everywhere.


Jonas
Martin
2011-10-10 21:11:01 UTC
Permalink
Post by Jonas Maebe
2- It should have (dont know if currently has) a compiler switch to change the default code page to UTF8 or whatever, so all variables with type String will map to UTF8String.
I doubt that such a feature will be added. If you want that, declare your own string type with whatever default code page you want to use and use that type everywhere.
But that will always just push the issue to another location.
Somewhere the change from string to utf8string must be made.

In my case that is in synedit. Even if I changed every string in
synedit, it would still be used from the IDE and many user apps, with
just "string". So then the text would be corrupted at that point.
The only way to do that, is if every single fpc/lazarus user changes at
the same time.

And what happens if an app did read data from some external source
(serial port) and then wants to declare what encoding it is?
Jonas Maebe
2011-10-10 21:36:16 UTC
Permalink
Post by Martin
Post by Jonas Maebe
2- It should have (dont know if currently has) a compiler switch to change the default code page to UTF8 or whatever, so all variables with type String will map to UTF8String.
I doubt that such a feature will be added. If you want that, declare your own string type with whatever default code page you want to use and use that type everywhere.
But that will always just push the issue to another location.
Changing the default code page of the "string" type in a particular unit via a compiler switch would not change that (a program usually also uses units that have been compiled earlier on, and those may then have used a different default code page).
Post by Martin
Somewhere the change from string to utf8string must be made.
As long as that string contains correct code page information, that is no problem. But as mentioned in the message by Luiz: "AFAIK the fpc feature is incomplete and buggy so it will not work anyway and making workarounds to it now is a bad decision." He is correct. Especially regarding constant strings there are still several bugs. Some of them may or may not be fixed by the patches from http://bugs.freepascal.org/view.php?id=20449
Post by Martin
And what happens if an app did read data from some external source (serial port) and then wants to declare what encoding it is?
http://docwiki.embarcadero.com/VCL/en/System.SetCodePage


Jonas
Martin
2011-10-10 22:10:26 UTC
Permalink
Post by Jonas Maebe
Post by Martin
Post by Jonas Maebe
2- It should have (dont know if currently has) a compiler switch to change the default code page to UTF8 or whatever, so all variables with type String will map to UTF8String.
I doubt that such a feature will be added. If you want that, declare your own string type with whatever default code page you want to use and use that type everywhere.
But that will always just push the issue to another location.
Changing the default code page of the "string" type in a particular unit via a compiler switch would not change that (a program usually also uses units that have been compiled earlier on, and those may then have used a different default code page).
I wasn't askin for changing the default.

just for how to do

procedure foo(x: utf8string); begin end;

var a: string; //ansistring, but contains already utf8

foo(a); // do not convert
Post by Jonas Maebe
Post by Martin
And what happens if an app did read data from some external source (serial port) and then wants to declare what encoding it is?
http://docwiki.embarcadero.com/VCL/en/System.SetCodePage
I hadn't seen that.

That may help. Though not the best solution...

I can call it before calling the "foo" proc. But I must revert it
afterwards, or at sometime later, the string will be translated, when it
will be used in a normal string again (yet expected to keep being utf8..

Yes, I know, what i want to do, is not what it was designed for.
ultimately a huge update to the entire source will be needed... but now
I need a temporary solution until then
Paul Ishenin
2011-10-10 22:19:15 UTC
Permalink
Post by Martin
I wasn't askin for changing the default.
just for how to do
procedure foo(x: utf8string); begin end;
var a: string; //ansistring, but contains already utf8
foo(a); // do not convert
use
foo(x: rawbytestring)
Post by Martin
Post by Jonas Maebe
Post by Martin
And what happens if an app did read data from some external source
(serial port) and then wants to declare what encoding it is?
http://docwiki.embarcadero.com/VCL/en/System.SetCodePage
I hadn't seen that.
That may help. Though not the best solution...
I can call it before calling the "foo" proc. But I must revert it
afterwards, or at sometime later, the string will be translated, when
it will be used in a normal string again (yet expected to keep being
utf8..
Yes, I know, what i want to do, is not what it was designed for.
ultimately a huge update to the entire source will be needed... but
now I need a temporary solution until then
Don't use utf8string type until all Lazarus code use it.

Best regards,
Paul Ishenin.
Martin
2011-10-10 22:23:05 UTC
Permalink
Post by Paul Ishenin
Post by Martin
I wasn't askin for changing the default.
just for how to do
procedure foo(x: utf8string); begin end;
var a: string; //ansistring, but contains already utf8
foo(a); // do not convert
use
foo(x: rawbytestring)
Not good

Utf8ToLower is, (and should) be declared expecting a Utf8String.

If indeed it was called with an ansistring, it should be converted
Post by Paul Ishenin
Post by Martin
Yes, I know, what i want to do, is not what it was designed for.
ultimately a huge update to the entire source will be needed... but
now I need a temporary solution until then
Don't use utf8string type until all Lazarus code use it.
I'd like to.

at curren LazUtils does....
Paul Ishenin
2011-10-10 22:28:23 UTC
Permalink
Post by Martin
I'd like to.
at curren LazUtils does....
For this case we have a Russian saying: "do not run ahead of the
locomotive".

Best regards,
Paul Ishenin.
Michael Schnell
2011-10-11 07:49:01 UTC
Permalink
Post by Martin
Utf8ToLower is, (and should) be declared expecting a Utf8String.
Why should a function Utf8ToLower be used (or even be defined for
normal use) ?

With dynamically encoded Strings "ToLower" should work for any encoding.

-Michael
Hans-Peter Diettrich
2011-10-11 09:20:46 UTC
Permalink
Post by Michael Schnell
Post by Martin
Utf8ToLower is, (and should) be declared expecting a Utf8String.
Why should a function Utf8ToLower be used (or even be defined for
normal use) ?
Because it expects and UTF8 argument, and provides an UTF8 result, so
that no further conversions are required when used with strings of
exactly that encoding.
Post by Michael Schnell
With dynamically encoded Strings "ToLower" should work for any encoding.
You mean something like this?
function ToLower(s: RawByteString): RawByteString;
[dunno whether RawByteString is an allowed Result type at all]

Then this function has to determine the encoding internally, convert
strings of unhandled encodings, and then do the conversion implemented
for the given or converted encoding. When the result is used, another
check of the encoding and possible conversion has to be inserted by the
compiler.


IMO you should understand that the new "string" type is bound to one
specific encoding, a dynamic re-encoding is not possible. Even Delphi
does not work with "polymorphic" strings, the generic "string" type is
UTF-16 encoded.

Use RawByteString instead, if you want strings with no fixed encoding.
But RawByteStrings imply an overhead, since the compiler must insert
checks and conversions whenever two strings of *possibly* different
encoding are involved in any operation, maybe assignment...

DoDi
Michael Schnell
2011-10-11 08:52:29 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Michael Schnell
Why should a function Utf8ToLower be used (or even be defined for
normal use) ?
Because it expects and UTF8 argument, and provides an UTF8 result, so
that no further conversions are required when used with strings of
exactly that encoding.
I don't understand your argument. If ToLower gets a new string that is
UTF8 encoded, the result should be a new string that is UTF8 encoded. So
why bother. Checking the encoding Word and branching tho the appropriate
encoding-aware functionality is a very fast operation.
Post by Hans-Peter Diettrich
Post by Michael Schnell
With dynamically encoded Strings "ToLower" should work for any encoding.
You mean something like this?
function ToLower(s: RawByteString): RawByteString;
[dunno whether RawByteString is an allowed Result type at all]
In fact I still don't understand the difference between a type called
"RawByteString"and a basic new String that happens to be set to the
encoding "RawByte".

IMO, calling ToLower with a string that is set to the encoding "RawByte"
does not make sense and should generate an exception. The user code
should set a decent "readable character" encoding before doing something
like ToLower (or "+" or comparing with a string that is in a "readable
character" encoding), if the string is obtained in a way that did not
set such a coding signature.

-Michael
Paul Ishenin
2011-10-11 09:30:26 UTC
Permalink
Post by Michael Schnell
In fact I still don't understand the difference between a type called
"RawByteString"and a basic new String that happens to be set to the
encoding "RawByte".
Encoding RawByte as well as encoding 0 (CP_ACP) are both treated as
DefaultSystemCodePage at pleaces where the paticular encoding must be known.

Best regards,
Paul Ishenin
Michael Schnell
2011-10-11 09:48:41 UTC
Permalink
Post by Paul Ishenin
Post by Michael Schnell
In fact I still don't understand the difference between a type called
"RawByteString"and a basic new String that happens to be set to the
encoding "RawByte".
Encoding RawByte as well as encoding 0 (CP_ACP) are both treated as
DefaultSystemCodePage at pleaces where the paticular encoding must be known.
(a) Sorry, but this does not answer the question I tried to ask
(Difference between a possible type called RawByteString and a basic
"new string" variable that happens to be set to the Encoding ID "RawByte").

(b) I had the impression that the Encoding "RawByte" prevents any
auto-Conversion. If this is not the case, IMHO, such an encoding ID
should exist, as it can make sense to have a string variable that either
is not intended to hold readable text (but just bytes) or the coding of
which is still unknown and that might get a "readable text" encoding ID
later.

-Michael
Hans-Peter Diettrich
2011-10-11 20:11:57 UTC
Permalink
Post by Michael Schnell
Post by Paul Ishenin
Post by Michael Schnell
In fact I still don't understand the difference between a type called
"RawByteString"and a basic new String that happens to be set to the
encoding "RawByte".
Encoding RawByte as well as encoding 0 (CP_ACP) are both treated as
DefaultSystemCodePage at pleaces where the paticular encoding must be known.
Thanks Paul, now I understand the default AnsiString codepage/encoding
better. I already wondered why an AnsiString has codepage 1252, what
would be quite useless on a Russian, Greek or Japanese system.
Post by Michael Schnell
(a) Sorry, but this does not answer the question I tried to ask
(Difference between a possible type called RawByteString and a basic
"new string" variable that happens to be set to the Encoding ID "RawByte").
When I have a variable of type AnsiString, and assign an string to it,
then its encoding is reported as 1252 (my system codepage). On Paul's
machine it will have a different encoding, I assume?

DoDi
Paul Ishenin
2011-10-11 22:31:22 UTC
Permalink
Post by Hans-Peter Diettrich
When I have a variable of type AnsiString, and assign an string to it,
then its encoding is reported as 1252 (my system codepage). On Paul's
machine it will have a different encoding, I assume?
Yes, 1251 here.

Best regards,
Paul Ishenin.
Michael Schnell
2011-10-12 08:03:13 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Michael Schnell
(a) Sorry, but this does not answer the question I tried to ask
(Difference between a possible type called RawByteString and a basic
"new string" variable that happens to be set to the Encoding ID "RawByte").
When I have a variable of type AnsiString, and assign an string to it,
then its encoding is reported as 1252 (my system codepage). On Paul's
machine it will have a different encoding, I assume?
Via personal consulting ( :) ) I learned that the multiple new Pascal -
string - types just are a kind of syntax-candy for an underlying common
dynamically typed (and functioning in that way) string type. Seemingly
when allocated theses strings get an appropriate encoding ID that is
effective even with a zero length.

Seemingly (other than I assumed) a " := " between new strings does not
preserve the encoding, but performs an encoding conversion to the
target's encoding ID.

So for preventing a conversion, you need to make sure that the target
has the same (or a compatible) encoding ID as the source. (Either by
using the appropriate string types (hoping the the encoding ID has not
been changed ) or by using SetCodePage.) I suppose there also is a
function that is done to do a "pure" code-ID preserving assignment.

I suppose a variable of the type "String" is pre-loaded with the
predefined "System" encoding ID.

-Michael
Paul Ishenin
2011-10-12 08:09:27 UTC
Permalink
Post by Michael Schnell
I suppose a variable of the type "String" is pre-loaded with the
predefined "System" encoding ID.
If you mean "AnsiString" then it is loaded with encoding 0 which means
default system codepage. It will get the real encoding number after the
first assignment.

Best regards,
Paul Ishenin.
Michael Schnell
2011-10-12 08:24:57 UTC
Permalink
Post by Paul Ishenin
Post by Michael Schnell
I suppose a variable of the type "String" is pre-loaded with the
predefined "System" encoding ID.
If you mean "AnsiString" then it is loaded with encoding 0 which means
default system codepage. It will get the real encoding number after
the first assignment.
I understand that some day (when the official release comes up) "String"
will be a new String type and thus ANSIString obsolete and just an alias.

So target encoding ID "0" means that " := " will preserve the encoding
of the source and set the target appropriately without doing a conversion.

Any other encoding ID will be unmodified in the target and a conversion
will be done if appropriate. (supposedly no conversion when the target's
encoding ID is "Raw").

Seems clever but not easy stuff to understand. :)

-Michael
Sven Barth
2011-10-12 08:35:36 UTC
Permalink
Post by Michael Schnell
Post by Paul Ishenin
Post by Michael Schnell
I suppose a variable of the type "String" is pre-loaded with the
predefined "System" encoding ID.
If you mean "AnsiString" then it is loaded with encoding 0 which means
default system codepage. It will get the real encoding number after
the first assignment.
I understand that some day (when the official release comes up) "String"
will be a new String type and thus ANSIString obsolete and just an alias.
No. In Delphi "String = UnicodeString", but AnsiString still exists as a
one-byte (or multi-byte) string type (the "new string type" or "code
page aware string type").

Regards,
Sven
Michael Schnell
2011-10-12 08:50:49 UTC
Permalink
Post by Sven Barth
No. In Delphi "String = UnicodeString", but AnsiString still exists as
a one-byte (or multi-byte) string type (the "new string type" or "code
page aware string type").
Sorry, but I don't understand.

According to the "TAnsiRec", such a "New String" not only has an
encoding ID, but also an "ElementSize" specification.

So an "ANSIString" that uses the TAnsiRec for it's implementation
obviously is capable of holding of 1, 2 and 4 byte encoded data.

-Michael
Hans-Peter Diettrich
2011-10-12 11:53:57 UTC
Permalink
Post by Michael Schnell
Post by Sven Barth
No. In Delphi "String = UnicodeString", but AnsiString still exists as
a one-byte (or multi-byte) string type (the "new string type" or "code
page aware string type").
Sorry, but I don't understand.
According to the "TAnsiRec", such a "New String" not only has an
encoding ID, but also an "ElementSize" specification.
So an "ANSIString" that uses the TAnsiRec for it's implementation
obviously is capable of holding of 1, 2 and 4 byte encoded data.
Ansi and Unicode strings share the same header structure. This would
allow, in theory, to use only one (polymorphic) string type everywhere.
In fact this common structure only allows to find out about the
properties of an given string, it does not imply any assignment
compatibility.

All AnsiString types have an element size of 1, UnicodeString has 2 and
UCS4String has 4 bytes per element.

DoDi
Michael Schnell
2011-10-12 12:07:57 UTC
Permalink
Post by Hans-Peter Diettrich
All AnsiString types have an element size of 1, UnicodeString has 2
and UCS4String has 4 bytes per element.
Disregarding whether or not this makes sense: what technology enforces
this (e.g. Compiler Magic or RTL) ?

-Michael
Sven Barth
2011-10-12 12:09:00 UTC
Permalink
Post by Michael Schnell
Post by Hans-Peter Diettrich
All AnsiString types have an element size of 1, UnicodeString has 2
and UCS4String has 4 bytes per element.
Disregarding whether or not this makes sense: what technology enforces
this (e.g. Compiler Magic or RTL) ?
Basically both, as both rely on and use the fact that "AnsiString[i] =
AnsiChar" and "SizeOf(AnsiChar) = 1" and also "UnicodeString[i] =
UnicodeChar" and "SizeOf(UnicodeChar) = 2".

Regards,
Sven
Michael Schnell
2011-10-12 12:24:13 UTC
Permalink
Post by Sven Barth
Basically both, as both rely on and use the fact that "AnsiString[i] =
AnsiChar" and "SizeOf(AnsiChar) = 1" and also "UnicodeString[i] =
UnicodeChar" and "SizeOf(UnicodeChar) = 2".
Yep.

But what I wanted to ask is what happens, if I disregard this, e.g using
the "wrong" String type as a parameter when calling a function or when
using "SetCodePage" in a "creative" way. What about Type casting ?

-Michael
Sven Barth
2011-10-12 12:28:04 UTC
Permalink
Post by Michael Schnell
Post by Sven Barth
Basically both, as both rely on and use the fact that "AnsiString[i] =
AnsiChar" and "SizeOf(AnsiChar) = 1" and also "UnicodeString[i] =
UnicodeChar" and "SizeOf(UnicodeChar) = 2".
Yep.
But what I wanted to ask is what happens, if I disregard this, e.g using
the "wrong" String type as a parameter when calling a function or when
using "SetCodePage" in a "creative" way. What about Type casting ?
There will be a conversion. Like was done before the code page aware
string as well. And then you might experience data loss if you e.g. cast
a UnicodeString with some fancy Unicode characters to an encoding which
doesn't support these characters (same as before).

Regards,
Sven
Michael Schnell
2011-10-12 12:40:26 UTC
Permalink
Post by Sven Barth
There will be a conversion.
Meaning:
- when it is a var parameter, am error message is issued.
- when it is a value parameter: conversion is called
- type cast will do a conversion
- assignment will do a conversion (at least if the target encoding ID
is not zero.)
- the conversion will not happen if the source encoding ID is "Raw"

Correct ?
Post by Sven Barth
Like was done before the code page aware string as well. And then you
might experience data loss if you e.g. cast a UnicodeString with some
fancy Unicode characters to an encoding which doesn't support these
characters (same as before).
Of course.

-Michael
Sven Barth
2011-10-12 12:39:40 UTC
Permalink
Post by Michael Schnell
Post by Sven Barth
There will be a conversion.
- when it is a var parameter, am error message is issued.
- when it is a value parameter: conversion is called
- type cast will do a conversion
- assignment will do a conversion (at least if the target encoding ID is
not zero.)
- the conversion will not happen if the source encoding ID is "Raw"
Correct ?
Without knowing the details (I haven't worked on/with code page aware
strings), I'd say yes (the only point I'm really unsure about is
"var"-arguments).

Regards,
Sven
Michael Schnell
2011-10-12 13:35:55 UTC
Permalink
I'd say yes (the only point I'm really unsure about is "var"-arguments).
I did this list according to what I expect regarding to different
numerical types (like integer and real) or two really different string
types (like short string and long string).

With that, an incorrectly used var parameter results in a compiler error
message, while the other cases result in conversion (not really sure
with type cast).

In fact the var parameter case is most interesting regarding new strings.

While in the other cases the system can decide at runtime what do do
(with respect to the encoding ID (s) ), with a var parameter the type
names might be used to generate an error message at compile time.

-Michael
Sven Barth
2011-10-12 14:23:05 UTC
Permalink
Post by Michael Schnell
I'd say yes (the only point I'm really unsure about is "var"-arguments).
I did this list according to what I expect regarding to different
numerical types (like integer and real) or two really different string
types (like short string and long string).
With that, an incorrectly used var parameter results in a compiler error
message, while the other cases result in conversion (not really sure
with type cast).
In fact the var parameter case is most interesting regarding new strings.
While in the other cases the system can decide at runtime what do do
(with respect to the encoding ID (s) ), with a var parameter the type
names might be used to generate an error message at compile time.
There was some discussion about how to handle var parameters, but I
don't remember the outcome anymore. AFAIK Delphi issues a compile error
(I don't know for sure though).

Regards,
Sven
Michael Schnell
2011-10-12 14:46:24 UTC
Permalink
Post by Sven Barth
There was some discussion about how to handle var parameters, but I
don't remember the outcome anymore. AFAIK Delphi issues a compile
error (I don't know for sure though).
Options are:

- compiler error
- compiler warning
- runtime exception
- conversion to and fro
- just passing the pointer (making creating an intersexual string easy)


We will see :)
Sven Barth
2011-10-12 14:59:51 UTC
Permalink
Post by Michael Schnell
Post by Sven Barth
There was some discussion about how to handle var parameters, but I
don't remember the outcome anymore. AFAIK Delphi issues a compile
error (I don't know for sure though).
- compiler error
- compiler warning
- runtime exception
- conversion to and fro
- just passing the pointer (making creating an intersexual string easy)
We will see :)
So... now you got me to start my Delphi XE and I then noticed that I had
already written a small test project to test this ^^

Here is the code:

=== source begin ===

program strvartest;

{$APPTYPE CONSOLE}

uses
SysUtils;

type
CyrillicString = type AnsiString(1251);
LatinString = type AnsiString(1252);
TestString = type AnsiString(65001);

procedure Test(var aStr: CyrillicString);
begin

end;

var
s: LatinString;
begin
s := 'Foo';
Test(s);
end.

=== source end ===

Result:
[DCC Fehler] strvartest.dpr(22): E2033 Die Typen der tatsächlichen und
formalen Var-Parameter müssen übereinstimmen

For those that don't speak German (so shouldn't be you, Michael) a free
translation:
[DCC Error] strvartest.dpr(22): E2033 The types of the actual and formal
var-paramaters have to match

So at least Delphi uses the compiletime error approach. So it should be
likely that FPC follows that approach.

Regards,
Sven
Michael Schnell
2011-10-12 15:20:45 UTC
Permalink
Thanks !
-Michael
Hans-Peter Diettrich
2011-10-13 00:05:10 UTC
Permalink
Post by Michael Schnell
Post by Sven Barth
There will be a conversion.
- when it is a var parameter, am error message is issued.
- when it is a value parameter: conversion is called
- type cast will do a conversion
Correct, so far.
Post by Michael Schnell
- assignment will do a conversion (at least if the target encoding ID
is not zero.)
... conversion unless the target type is RawByteString, or has the same
encoding.
Post by Michael Schnell
- the conversion will not happen if the source encoding ID is "Raw"
There is no "Raw" source ID. When source is the empty string, it
deserves no conversion. Otherwise source *has* an encoding, from the
last assignment to it.

DoDi

Sven Barth
2011-10-12 12:05:26 UTC
Permalink
Post by Michael Schnell
Post by Sven Barth
No. In Delphi "String = UnicodeString", but AnsiString still exists as
a one-byte (or multi-byte) string type (the "new string type" or "code
page aware string type").
Sorry, but I don't understand.
According to the "TAnsiRec", such a "New String" not only has an
encoding ID, but also an "ElementSize" specification.
So an "ANSIString" that uses the TAnsiRec for it's implementation
obviously is capable of holding of 1, 2 and 4 byte encoded data.
Trust me, I don't understand this decision of Embarcadero either...

Here you have the documentation on that topic:
http://docwiki.embarcadero.com/RADStudio/en/String_Types

Regards,
Sven

PS: Sorry for the private Mail, Michael, but somehow when I answer a
mail from you using "Answer to List" I have you as "To" instead of the list.
Hans-Peter Diettrich
2011-10-12 11:45:38 UTC
Permalink
Post by Michael Schnell
I understand that some day (when the official release comes up) "String"
will be a new String type and thus ANSIString obsolete and just an alias.
No. "string" is an alias (generic type), all other string types are
distinct types.
Post by Michael Schnell
So target encoding ID "0" means that " := " will preserve the encoding
of the source and set the target appropriately without doing a conversion.
No. Codepage 0 stands for the system encoding, formerly "native" string
encoding. I.e. 1252 on western Windows, maybe 65001 (UTF-8) on Linux
(user selectable).

DoDi
Michael Schnell
2011-10-12 12:13:28 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Michael Schnell
So target encoding ID "0" means that " := " will preserve the
encoding of the source and set the target appropriately without doing
a conversion.
No. Codepage 0 stands for the system encoding, formerly "native"
string encoding. I.e. 1252 on western Windows, maybe 65001 (UTF-8) on
Linux (user selectable).
Not when a string with encoding ID 0 is used as a target.
Post by Hans-Peter Diettrich
If you mean "AnsiString" then it is loaded with encoding 0 which means
default system codepage. It will get the real encoding number after
the first assignment.
-Michael
Hans-Peter Diettrich
2011-10-12 10:13:31 UTC
Permalink
Post by Paul Ishenin
Post by Michael Schnell
I suppose a variable of the type "String" is pre-loaded with the
predefined "System" encoding ID.
If you mean "AnsiString" then it is loaded with encoding 0 which means
default system codepage. It will get the real encoding number after the
first assignment.
Delphi allows for RawByteStrings with encoding 0. When assigned to an
AnsiString, the string encoding still is zero, both variables seem to
point to the same string data.

DoDi
Michael Schnell
2011-10-12 12:16:56 UTC
Permalink
Post by Hans-Peter Diettrich
Delphi allows for RawByteStrings with encoding 0. When assigned to an
AnsiString, the string encoding still is zero, both variables seem to
point to the same string data.
The pointing to the data array (managed by "lazy copy and reference
counting features) is independent from the encoding ID, that is part of
the string management record and not of the data array.

-Michael
Sven Barth
2011-10-12 12:17:36 UTC
Permalink
Post by Michael Schnell
Post by Hans-Peter Diettrich
Delphi allows for RawByteStrings with encoding 0. When assigned to an
AnsiString, the string encoding still is zero, both variables seem to
point to the same string data.
The pointing to the data array (managed by "lazy copy and reference
counting features) is independent from the encoding ID, that is part of
the string management record and not of the data array.
Wrong. Both reference counting and code page are part of the management
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/rtl/inc/astrings.inc?revision=19444&view=markup
Or did I misunderstand you?

Regards,
Sven
Michael Schnell
2011-10-12 12:26:47 UTC
Permalink
Post by Michael Schnell
The pointing to the data array (managed by "lazy copy and reference
counting features) is independent from the encoding ID, that is part of
the string management record and not of the data array.
Wrong. Both reference counting and code page are part of the management
record.
That is exactly what I wanted to say.
-Michael
Sven Barth
2011-10-12 12:26:02 UTC
Permalink
Post by Michael Schnell
Post by Michael Schnell
The pointing to the data array (managed by "lazy copy and reference
counting features) is independent from the encoding ID, that is part of
the string management record and not of the data array.
Wrong. Both reference counting and code page are part of the management
record.
That is exactly what I wanted to say.
Then I indeed have misunderstood you. ;)

Regards,
Sven
Sven Barth
2011-10-12 12:16:06 UTC
Permalink
Post by Michael Schnell
Post by Hans-Peter Diettrich
Delphi allows for RawByteStrings with encoding 0. When assigned to an
AnsiString, the string encoding still is zero, both variables seem to
point to the same string data.
The pointing to the data array (managed by "lazy copy and reference
counting features) is independent from the encoding ID, that is part of
the string management record and not of the data array.
Wrong. Both reference counting and code page are part of the management
record. See here line 38:
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/rtl/inc/astrings.inc?revision=19444&view=markup

Regards,
Sven
Hans-Peter Diettrich
2011-10-12 10:09:22 UTC
Permalink
Post by Michael Schnell
Post by Hans-Peter Diettrich
When I have a variable of type AnsiString, and assign an string to it,
then its encoding is reported as 1252 (my system codepage). On Paul's
machine it will have a different encoding, I assume?
Via personal consulting ( :) ) I learned that the multiple new Pascal -
string - types just are a kind of syntax-candy for an underlying common
dynamically typed (and functioning in that way) string type. Seemingly
when allocated theses strings get an appropriate encoding ID that is
effective even with a zero length.
The encoding is associated with string types, and every variable knows
its type. I.e. we have a static encoding, associated with string types
and variables, and a dynamic encoding of string data. Similar to the
static and dynamic types of object references.
Post by Michael Schnell
Seemingly (other than I assumed) a " := " between new strings does not
preserve the encoding, but performs an encoding conversion to the
target's encoding ID.
Right. The encoding etc., as stored in the string header, is used while
processing strings, e.g. in expressions. In the assignment to a variable
the static encoding of that variable must be compared with the dynamic
encoding of the string data, and a conversion must be performed whenever
required.
Post by Michael Schnell
So for preventing a conversion, you need to make sure that the target
has the same (or a compatible) encoding ID as the source. (Either by
using the appropriate string types
Right, the new string types are *strict* types, declared as
type UTF8String = type AnsiString(65001);
Note the second "type", denoting an new type, not an alias as in the old
declaration of
type UTF8String = AnsiString;
Post by Michael Schnell
(hoping the the encoding ID has not
been changed ) or by using SetCodePage.)
SetCodePage is applicable only to RawByteString, because this static
type is compatible with all dynamic types - like TObject is compatible
with all derived classes.
Post by Michael Schnell
I suppose there also is a
function that is done to do a "pure" code-ID preserving assignment.
Quite unlikely, this defeats the idea of static typing. Low-level
hacking is possible, of course, but the effects are unpredictable. The
compiler assumes that the dynamic encoding matches the static one, and
generates according code.
Post by Michael Schnell
I suppose a variable of the type "String" is pre-loaded with the
predefined "System" encoding ID.
No, empty strings still are Nil pointers.

DoDi
Michael Schnell
2011-10-12 12:19:22 UTC
Permalink
Post by Sven Barth
Post by Michael Schnell
Seemingly (other than I assumed) a " := " between new strings does
not preserve the encoding, but performs an encoding conversion to the
target's encoding ID.
Right.
As I now understand: Exception: Target encoding ID = 0, source encoding
ID <> 0. (very clever, but not easily understandable)

.-Michael
Michael Schnell
2011-10-12 12:20:50 UTC
Permalink
Post by Hans-Peter Diettrich
Right, the new string types are *strict* types,
That does make sense regarding Pascal's general "strict type" paradigm.

-Michael
Hans-Peter Diettrich
2011-10-11 19:37:27 UTC
Permalink
Post by Michael Schnell
Post by Hans-Peter Diettrich
Post by Michael Schnell
Why should a function Utf8ToLower be used (or even be defined for
normal use) ?
Because it expects and UTF8 argument, and provides an UTF8 result, so
that no further conversions are required when used with strings of
exactly that encoding.
I don't understand your argument. If ToLower gets a new string that is
UTF8 encoded, the result should be a new string that is UTF8 encoded. So
why bother. Checking the encoding Word and branching tho the appropriate
encoding-aware functionality is a very fast operation.
Why implement the upper/lower translation N times, when afterwards the N
encodings have to be converted into the Result encoding? Where the
encoding conversions already exist...
Post by Michael Schnell
Post by Hans-Peter Diettrich
Post by Michael Schnell
With dynamically encoded Strings "ToLower" should work for any encoding.
You mean something like this?
function ToLower(s: RawByteString): RawByteString;
[dunno whether RawByteString is an allowed Result type at all]
In fact I still don't understand the difference between a type called
"RawByteString"and a basic new String that happens to be set to the
encoding "RawByte".
I only have the Delphi implementation at hand, and there every string
type has an associated encoding. E.g. the default AnsiString has
codepage 1252 encoding, and when a string is assigned to such a
variable, it is converted into codepage 1252. The *only* exception is
the RawByteString type, which can have any encoding.
Post by Michael Schnell
IMO, calling ToLower with a string that is set to the encoding "RawByte"
does not make sense and should generate an exception.
When a string is assigned to an RawByteString, both point to the
original string, which has a valid (non-raw) encoding. The only
exception is an empty string, with a Nil pointer and consequently no
stored encoding - but such a string never deserves any conversion.

DoDi
Michael Schnell
2011-10-12 08:05:20 UTC
Permalink
Post by Hans-Peter Diettrich
Why implement the upper/lower translation N times, when afterwards the
N encodings have to be converted into the Result encoding? Where the
encoding conversions already exist...
Obviously, the dedicated upper/lower translation done in a certain
encoding is a lot faster than any re-encoding.

-Michael
Michael Schnell
2011-10-12 08:08:41 UTC
Permalink
Post by Michael Schnell
IMO, calling ToLower with a string that is set to the encoding
"RawByte" does not make sense and should generate an exception.
Nope.

A new string consists of a record that contains the encoding ID, element
size, reference count, length and the pointger to the content:

TAnsiRec = Packed Record
CodePage : TSystemCodePage;
ElementSize : Word;
{$ifdef CPU64}
{ align fields }
Dummy : DWord;
{$endif CPU64}
Ref : SizeInt;
Len : SizeInt;
First : AnsiChar;
end;

So each string variable has it's own dedicated encoding ID and can't
point to that of another string.

-Michael
Michael Schnell
2011-10-12 08:12:59 UTC
Permalink
The last answer was to
Post by Hans-Peter Diettrich
When a string is assigned to an RawByteString, both point to the
original string, which has a valid (non-raw) encoding.
-Michael
Hans-Peter Diettrich
2011-10-11 06:52:33 UTC
Permalink
Post by Martin
just for how to do
procedure foo(x: utf8string); begin end;
var a: string; //ansistring, but contains already utf8
The encoding will be stored or converted when a string is assigned to
that variable. When the FPC implementation is finished, it should be
impossible to have strings stored with a wrong encoding.
Post by Martin
foo(a); // do not convert
Why not?
Post by Martin
Post by Jonas Maebe
Post by Martin
And what happens if an app did read data from some external source
(serial port) and then wants to declare what encoding it is?
http://docwiki.embarcadero.com/VCL/en/System.SetCodePage
I hadn't seen that.
That may help. Though not the best solution...
It does *not* help, because SetCodePage does a string *conversion*, when
it really changes the encoding. Delphi even had allowed to convert
between UTF-16 (CP 1200) and other (byte oriented) encodings, but later
disallowed such in-place conversions again. Now an UTF-16 (Delphi
default) string is *always* converted, when it's passed to a subroutine
expecting an RawByteString argument.
Post by Martin
I can call it before calling the "foo" proc. But I must revert it
afterwards, or at sometime later, the string will be translated, when it
will be used in a normal string again (yet expected to keep being utf8..
IMO the only chance for fixing a wrong encoding is a TBytes (or similar)
buffer, then copy the string content into it (without translation), and
read it back specifying the correct encoding.
Post by Martin
Yes, I know, what i want to do, is not what it was designed for.
ultimately a huge update to the entire source will be needed... but now
I need a temporary solution until then
You don't need a temporary solution, until the new strings are perfectly
implemented in FPC. Afterwards you only have to take care for reading
strings from *external* sources, where you have to specify the correct
external encoding - see e.g.
http://docwiki.embarcadero.com/VCL/en/Classes.TStrings.LoadFromStream
with the added Encoding argument.

When you want a variable to contain strings of a specific encoding, e.g.
UTF-8, you simply give it the appropriate type. I assume that an
UTF8String type will be declared like AnsiString<cpUTF8>, with
appropriate constants being declared for the standard codepages.

DoDi
Michael Schnell
2011-10-11 07:53:26 UTC
Permalink
Post by Hans-Peter Diettrich
It does *not* help, because SetCodePage does a string *conversion*,
when it really changes the encoding.
That of course does make sense. OTOH, there should (must) be a function
that forces the encoding to some setting without looking at the byte (or
Word or DWord) array content.

I suppose, as a low-level programmer you can do this in a similar way as
you can access the reference counter.

-Michael
Michael Schnell
2011-10-11 10:52:48 UTC
Permalink
Post by Hans-Peter Diettrich
It does *not* help, because SetCodePage does a string *conversion*,
Nope.

procedure SetCodePage(var s : RawByteString; CodePage : TSystemCodePage;
Convert : Boolean = True);

So it can be set to do a conversion or not to do it.

-Michael
Hans-Peter Diettrich
2011-10-11 06:25:50 UTC
Permalink
Post by Martin
In my case that is in synedit. Even if I changed every string in
synedit, it would still be used from the IDE and many user apps, with
just "string". So then the text would be corrupted at that point.
Text corruption can occur only when the stored encoding does not match
the real encoding of the contained text (bytes). When the FPC
implementation is finished, such mismatches should be possible only with
strings read from files or other sources (OS/lib API), with wrong
encodings specified on the FPC side.
Post by Martin
The only way to do that, is if every single fpc/lazarus user changes at
the same time.
When every user can have a different encoding of generic strings, we'll
have to face bug reports which depend on the user locale!
Post by Martin
And what happens if an app did read data from some external source
(serial port) and then wants to declare what encoding it is?
That's not different from the current situation. Network communication
must agree about the encoding to use, and the user is responsible for
the correct encoding when local files are read or written. When files
are stored in UTF-8 (or UTF-16), a BOM will leave no room for wrong
guessing.

DoDi
Michael Schnell
2011-10-11 07:40:07 UTC
Permalink
Post by Martin
But that will always just push the issue to another location.
Somewhere the change from string to utf8string must be made.
??? The "new string" paradigm is all about changing from utf8string (and
other such types) to string. Isn't it ?

The big decision was to do either (a) multiple string types with
dedicated encoding or (b) a single string type with dynamic encoding.
While I don't dare to decide which way is the better one, the decision
to do (b) has now finally been taken. So - like it or not - we should
stick to it in all projects and do away with types like ANSIString,
utf8string etc.

-Michael
Hans-Peter Diettrich
2011-10-11 09:28:30 UTC
Permalink
Post by Michael Schnell
Post by Martin
But that will always just push the issue to another location.
Somewhere the change from string to utf8string must be made.
??? The "new string" paradigm is all about changing from utf8string (and
other such types) to string. Isn't it ?
No.
Post by Michael Schnell
The big decision was to do either (a) multiple string types with
dedicated encoding
That's really new: every string type now has a fixed encoding.

or (b) a single string type with dynamic encoding.

That's a RawByteString.

Excessive use of RawByteString is at least as inefficient as using
Variant for everything :-(

DoDi
Michael Schnell
2011-10-11 08:58:08 UTC
Permalink
Post by Hans-Peter Diettrich
Excessive use of RawByteString is at least as inefficient as using
Variant for everything :-(
While I still doubt that (only) RawByteString is a type that is to be
used for dynamical encoding....

That is correct, depending on what excessive means in a practical case.

If the encoding is the same in the flow of the complete program, only
the checking of the code ID is necessary and this does not impose
considerable speed degradation. If you deal with multiple encoding at
the same time in an unthoughtful way, a speed disaster is bound to happen.

-Michael
Luiz Americo Pereira Camara
2011-10-10 22:06:13 UTC
Permalink
Post by Jonas Maebe
1- Most of LCL must be code page agnostic, so not use UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode delphi} and {$mode objfpc}
OK.
There's just one problem using $mode to define the string behavior: say
you have a component written in {$mode delphi}. Than code written in
{$mode delphiunicode} uses that library.
Post by Jonas Maebe
. In a future delphiunicode mode or something like that string will be unicodestring
What about the Marco proposition of having separated versions of
RTL/Classes for UTF8 / UTF16? Or did i miss something?
Post by Jonas Maebe
, but that's not "code-page agnostic" either. The only somewhat code page agnostic string type is RawByteString.
I dont mean string type unicode agnostic. I mean code unicode agnostic,
i.e., will work regardless of the code page
Post by Jonas Maebe
2- It should have (dont know if currently has) a compiler switch to change the default code page to UTF8 or whatever, so all variables with type String will map to UTF8String.
I doubt that such a feature will be added. If you want that, declare your own string type with whatever default code page you want to use and use that type everywhere.
Ok. This in practice will force Lazarus to go to UTF16 since renaming
all string types of LCL from String to UTF8String is a no-no, at least
for me.

Luiz
Jonas Maebe
2011-10-10 22:18:06 UTC
Permalink
Post by Jonas Maebe
1- Most of LCL must be code page agnostic, so not use UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode delphi} and {$mode objfpc}
OK.
There's just one problem using $mode to define the string behavior: say you have a component written in {$mode delphi}. Than code written in {$mode delphiunicode} uses that library.
That is no more a problem than using code using string in {$h-} mode with code using string in {$h+} mode.
Post by Jonas Maebe
. In a future delphiunicode mode or something like that string will be unicodestring
What about the Marco proposition of having separated versions of RTL/Classes for UTF8 / UTF16? Or did i miss something?
That would not change the meaning of the "string" type. The code in rtl/classes would then use a custom string type (RTLString or whatever) that is defined as either an utf8string or a unicodestring based on some define.
Post by Jonas Maebe
, but that's not "code-page agnostic" either. The only somewhat code page agnostic string type is RawByteString.
I dont mean string type unicode agnostic. I mean code unicode agnostic, i.e., will work regardless of the code page
Generally, the only string types that will always work regardless of the used code pages are utf8string and unicodestring (and maybe some utf32string type, although afaik that's just a dynamic array type). Any other string type except for RawByteString will result in code page conversions that may be lossy (and RawByteString itself has its own share of gotchas to watch out for if you use it for anything else than parameters).
Post by Jonas Maebe
2- It should have (dont know if currently has) a compiler switch to change the default code page to UTF8 or whatever, so all variables with type String will map to UTF8String.
I doubt that such a feature will be added. If you want that, declare your own string type with whatever default code page you want to use and use that type everywhere.
Ok. This in practice will force Lazarus to go to UTF16 since renaming all string types of LCL from String to UTF8String is a no-no, at least for me.
I really don't see how adding a feature to the compiler to change the default definition of the string type would change anything. As I said, you can achieve exactly the same result by using a custom defined string type.


Jonas
Luiz Americo Pereira Camara
2011-10-11 01:12:53 UTC
Permalink
Post by Jonas Maebe
Post by Jonas Maebe
. In a future delphiunicode mode or something like that string will be unicodestring
What about the Marco proposition of having separated versions of RTL/Classes for UTF8 / UTF16? Or did i miss something?
That would not change the meaning of the "string" type. The code in rtl/classes would then use a custom string type (RTLString or whatever) that is defined as either an utf8string or a unicodestring based on some define.
So in resume the unicode version of fpc classes unit will have String =
UnicodeString/UTF16 always
Post by Jonas Maebe
Ok. This in practice will force Lazarus to go to UTF16 since renaming all string types of LCL from String to UTF8String is a no-no, at least for me.
I really don't see how adding a feature to the compiler to change the default definition of the string type would change anything.
I was assuming that would be a version of UTF8 classes unit where string
= UTF8. So i think that would be possible to choose the UTF8 unicode
classes and force Lazarus to be compiled with String = UTF8 to match
Post by Jonas Maebe
As I said, you can achieve exactly the same result by using a custom defined string type.
Yes and as i said is a no-no (in my humble opinion) for Lazarus to
change from String to UTF8String or LazString. There are tons of code /
components under Lazarus/LCL ecosystem that would need such change. Also
porting Delphi VCL components would be a lot harder. What about the form
resources streaming? Change string types with search and replace?

But lets say Lazarus wants to stay with UTF8 and choose to go with your
suggestion and change to a custom string type like UTF8String.

Under unix i would have LCL (UTF8) <> Classes (UTF16) <> RTL (UTF8)

It will be impossible to track / avoid string conversions

Luiz
Hans-Peter Diettrich
2011-10-11 07:17:29 UTC
Permalink
Post by Luiz Americo Pereira Camara
Yes and as i said is a no-no (in my humble opinion) for Lazarus to
change from String to UTF8String or LazString. There are tons of code /
components under Lazarus/LCL ecosystem that would need such change. Also
porting Delphi VCL components would be a lot harder.
IMO Lazarus (and FPC) should follow the Delphi way, with strictly
separate Unicode and pre-Unicode versions. Nobody can expect that new
VCL (Unicode) components can be back-ported to Ansi versions.

As long as the LCL is not fully implemented for D7 compatibility, adding
post-D7 extensions is very questionable, at least. It may be fun for the
core developers, but not for the users. Such attempts will end up in a
*third* model, that is compatible with *neither* D7 nor any later Delphi
version :-(

DoDi
Michael Schnell
2011-10-11 08:02:34 UTC
Permalink
Post by Hans-Peter Diettrich
IMO Lazarus (and FPC) should follow the Delphi way, with strictly
separate Unicode and pre-Unicode versions. Nobody can expect that new
VCL (Unicode) components can be back-ported to Ansi versions.
As right now the current (Unicode aware) LCL version (forcing UTF-8 code
in type ANSIString variables and with this provides some problems that
the pre-Unicode version did not show) this already has been done.

So the move to "new strings" would be the move to _another_ Unicode
awareness, that - regarding the legacy user code - hopefully could be
even more compatible to the long gone pre-Unicode version.

-Michael
Luiz Americo Pereira Camara
2011-10-11 01:30:58 UTC
Permalink
Post by Jonas Maebe
What about the Marco proposition of having separated versions of RTL/Classes for UTF8 / UTF16? Or did i miss something?
That would not change the meaning of the "string" type. The code in rtl/classes would then use a custom string type (RTLString or whatever) that is defined as either an utf8string or a unicodestring based on some define.
A snippet from Marco earlier mail in this list

"

The constant pressure from the Lazaurs team was the main rationale to come
up with two RTLs. Since the original unicode discussion on core (early 2009,
just before 2009 came out) came up with a type that was mostly UTF16 on
Windows.

People always whined about overloading as a solution, but that won't work
because of virtual methods with string parmaeters or returnvalues in it.

Moreover even if Lazarus decided to migrate to that, it would be totally
broken for a long while till it caught up. The current route is friendlier.

In the UTF8 RTL, all "string"s_ARE_ utf8, unless specified otherwise (by
naming them unicodestring or ansistring(..encoding) or shortstrings).

So the same virtual method with a STRING parameter will be TUnicodestring

in the UTF16 rtl and UTF8string in the utf8 RTL.

"

Luiz
Hans-Peter Diettrich
2011-10-11 07:01:58 UTC
Permalink
Post by Jonas Maebe
Post by Jonas Maebe
Post by Luiz Americo Pereira Camara
1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode
delphi} and {$mode objfpc}
OK. There's just one problem using $mode to define the string
behavior: say you have a component written in {$mode delphi}. Than
code written in {$mode delphiunicode} uses that library.
That is no more a problem than using code using string in {$h-} mode
with code using string in {$h+} mode.
IMO {$h} should be dropped, since the compiler is the only application
that still uses ShortStrings. At least the default should be {$h+}
nowadays, and the compiler should warn or hint whenever a $h directive
is found in source code.

DoDi
Tomas Hajny
2011-10-11 07:17:34 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Jonas Maebe
Post by Jonas Maebe
Post by Luiz Americo Pereira Camara
1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode
delphi} and {$mode objfpc}
OK. There's just one problem using $mode to define the string
behavior: say you have a component written in {$mode delphi}. Than
code written in {$mode delphiunicode} uses that library.
That is no more a problem than using code using string in {$h-} mode
with code using string in {$h+} mode.
IMO {$h} should be dropped, since the compiler is the only application
that still uses ShortStrings. At least the default should be {$h+}
nowadays, and the compiler should warn or hint whenever a $h directive
is found in source code.
Why should it be dropped? There are use cases when a shortstring is
so much more appropriate (especially for texts known to be short to
avoid the memory allocation overheads). Why users should be warned
when using it (especially when using it explicitly)?

Tomas
Sven Barth
2011-10-11 07:56:39 UTC
Permalink
Post by Tomas Hajny
Post by Hans-Peter Diettrich
Post by Jonas Maebe
Post by Jonas Maebe
Post by Luiz Americo Pereira Camara
1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode
delphi} and {$mode objfpc}
OK. There's just one problem using $mode to define the string
behavior: say you have a component written in {$mode delphi}. Than
code written in {$mode delphiunicode} uses that library.
That is no more a problem than using code using string in {$h-} mode
with code using string in {$h+} mode.
IMO {$h} should be dropped, since the compiler is the only application
that still uses ShortStrings. At least the default should be {$h+}
nowadays, and the compiler should warn or hint whenever a $h directive
is found in source code.
Why should it be dropped? There are use cases when a shortstring is
so much more appropriate (especially for texts known to be short to
avoid the memory allocation overheads). Why users should be warned
when using it (especially when using it explicitly)?
He doesn't talk about dropping "ShortString" entirely, but make "{$H+}"
the default for mode "ObjFPC" like was done for mode "Delphi".

Regards,
Sven
Hans-Peter Diettrich
2011-10-11 06:11:47 UTC
Permalink
Post by Jonas Maebe
Post by Luiz Americo Pereira Camara
1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode
delphi} and {$mode objfpc}.
You obviously missed that the new AnsiString type has an encoding, with
implicit conversions when strings of different codepages are passed to
subroutines or stored in variables. An AnsiString on one machine may
have a different encoding on a machine with a different user locale.
When a string contains UTF-8, its encoding must be set to UTF-8 as well,
otherwise implicit conversions will result in garbage - as observed by
the OP.
Post by Jonas Maebe
In a future delphiunicode mode or
something like that string will be unicodestring, but that's not
"code-page agnostic" either. The only somewhat code page agnostic
string type is RawByteString.
RawByteString can be used only for pass-through strings in subroutines,
which have string arguments, but do not manipulate these arguments
themselves. We could start to find out all subroutines and methods of
that type...

All "const" and "var" parameters also may deserve special
considerations. When the encoding of an "const" string can not be
changed, as required, then local copies must be used. I'm not sure of
the implementation of "var" parameters right now, because a user may not
be happy when a called subroutine changes the encoding of his string
variable. But more probably implicit conversions may be inserted, before
and after the subroutine call, resulting in very inefficient code.


For all these reasons Delphi has choosen UTF-16 for the new generic
string type, where encoding conversions are required *really* only when
explicit AnsiStrings are used, e.g. in records or legacy code. IMO the
FPC and Lazarus should take the same step, sooner or later. The few
situations, where an OS or library (widgetset...) API requires UTF-8
encoded strings, should have no noticeable runtime impact.

DoDi
Sven Barth
2011-10-11 07:59:07 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Jonas Maebe
Post by Luiz Americo Pereira Camara
1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
There is no difference between ansistring and string in {$mode
delphi} and {$mode objfpc}.
You obviously missed that the new AnsiString type has an encoding, with
implicit conversions when strings of different codepages are passed to
subroutines or stored in variables. An AnsiString on one machine may
have a different encoding on a machine with a different user locale.
When a string contains UTF-8, its encoding must be set to UTF-8 as well,
otherwise implicit conversions will result in garbage - as observed by
the OP.
Nevertheless Jonas' statement is correct, because (currently) String =
AnsiString and thus they are the same (and both can currently use code
pages).

Regards,
Sven
Hans-Peter Diettrich
2011-10-11 09:35:08 UTC
Permalink
Post by Sven Barth
Nevertheless Jonas' statement is correct, because (currently) String =
AnsiString and thus they are the same (and both can currently use code
pages).
Really? This were absolutely incompatible with Delphi!

And it would be as inefficient as using Variant for everything :-(

DoDi
Sven Barth
2011-10-11 08:43:55 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Sven Barth
Nevertheless Jonas' statement is correct, because (currently) String =
AnsiString and thus they are the same (and both can currently use code
pages).
Really? This were absolutely incompatible with Delphi!
This is because - as you might have noticed - the current code is still
far from a good working condition. So it would be INSANE to change the
default string type as well in the same go. Also Free Pascal takes its
legacy heritage much more serious than Delphi. Thus I doubt (personal
opinion/observation) that the modes Delphi and ObjFPC will change their
default string type to UnicodeString at all. What I can imagine though
is that a new mode DelphiUnicode and a new modeswitch Unicodestrings
will be introduced which do exactly what Delphi 2009 has done: change
the default string to UnicodeString. But unlike Delphi this will be
possible on a per unit base. (That the RTL must be ready for something
like this is a different topic)

Regards,
Sven
Hans-Peter Diettrich
2011-10-11 21:39:29 UTC
Permalink
Post by Sven Barth
Post by Hans-Peter Diettrich
Post by Sven Barth
Nevertheless Jonas' statement is correct, because (currently) String =
AnsiString and thus they are the same (and both can currently use code
pages).
Really? This were absolutely incompatible with Delphi!
This is because - as you might have noticed - the current code is still
far from a good working condition. So it would be INSANE to change the
default string type as well in the same go.
+1

Thanks for the clarification :-)
Post by Sven Barth
Also Free Pascal takes its
legacy heritage much more serious than Delphi. Thus I doubt (personal
opinion/observation) that the modes Delphi and ObjFPC will change their
default string type to UnicodeString at all.
Okay, I'll stop asking boring questions about the future strings in FPC,
until the opinions have settled down.

DoDi
Hans-Peter Diettrich
2011-10-11 05:33:19 UTC
Permalink
Post by Luiz Americo Pereira Camara
1- Most of LCL must be code page agnostic, so not use
UTF8String/AnsiString directly (keep String)
I'd use another type, e.g. LCLstring, which can be set independently
from any other automatisms.
Post by Luiz Americo Pereira Camara
2- It should have (dont know if currently has) a compiler switch to
change the default code page to UTF8 or whatever, so all variables with
type String will map to UTF8String.
What if a user has a different opinion, and changes the type to his
local codepage, or to UTF-16?

A boundary could be established, where strings are encoded as the *user*
specifies (i.e. generic "string"), and where the *LCL* requires a
specific (implemented) encoding.
Post by Luiz Americo Pereira Camara
3- The UTF8String/AnsiString type should be reserved where strictly
necessary like libraries that require UTF8 or RTL interfacing
Concrete types are required when strings are manipulated (parsed...),
and the implementation assumes a certain encoding. This should not
happen often in the LCL, but will be vital for the IDE (CodeTools...).

Properties like SelLength deserve considerations, when they currently
mean the number of logical (UTF-8) characters.

DoDi
Michael Schnell
2011-10-11 08:35:07 UTC
Permalink
Post by Hans-Peter Diettrich
I'd use another type, e.g. LCLstring, which can be set independently
from any other automatisms.
While using a "private" string type "just in case" or for flexibility in
a work in progress, might be a good idea, the goal should be to
everywhere use the dynamically encoded basic new string type and have
the private type be an alias to same.

-Michael
Michael Schnell
2011-10-11 08:42:01 UTC
Permalink
Post by Hans-Peter Diettrich
Concrete types are required when strings are manipulated (parsed...),
and the implementation assumes a certain encoding.
Why do you think so. when parsing, a 32 bit Unicode character can be
extracted from a new string with any (not raw) encoding. When
manipulating partial strings and assignments get the encoding of the
original string, when comparing strings, conversion is done automatically.

For speed, the user might want to do his code appropriately (e.g. by
anticipating conversions).

-Michael
Michael Schnell
2011-10-11 07:34:10 UTC
Permalink
Post by Luiz Americo Pereira Camara
2- It should have (dont know if currently has) a compiler switch to
change the default code page to UTF8 or whatever, so all variables
with type String will map to UTF8String.
Why ?

I feel, the LCL code should only be codepage aware directly at the OS /
Widget-set - interface. Same might or might not be UTF8. I don't see why
the LCL should force anything else to a certain encoding (such as UTF-8).

-Michael
Marco van de Voort
2011-10-11 02:00:21 UTC
Permalink
Post by Jonas Maebe
That would not change the meaning of the "string" type. The code in
rtl/classes would then use a custom string type (RTLString or whatever)
that is defined as either an utf8string or a unicodestring based on some
define.
I did plan to make the string type change. Since anything else would make
classes incompatible with Delphi code. (be it old or new, that uses string)

I know that Florian and you wanted to see the default string as something of a
dialect mode, but I never saw a way to do that practically.
Post by Jonas Maebe
RawByteString will result in code page
conversions that may be lossy (and RawByteString itself has its own share
of gotchas to watch out for if you use it for anything else than
parameters).
Important enough to quote again: view rawbytestring is like a special open array. Useful
for its specific purpose, but not a main type.
Post by Jonas Maebe
I really don't see how adding a feature to the compiler to change the
default definition of the string type would change anything. As I said,
you can achieve exactly the same result by using a custom defined string
type.
It would make it match with the bulk of code out there. That is the main
point I'm trying to make with the multi rtl proposal. You can't simply see
the base type as something you {$H anymore.

With the shortstring -> ansistring conversion we changed RTL (from the TP
oriented to the Delphi oriented, and over time e.g. we exchanges dos for
sysutils). And effectively shortstring (while still a dialect mode) is
hardly used (and IMHO we should have changed default mode 5 years ago)

But now we keep using sysutils, and the rest of the RTL. If there is
anything that I want to avoid is telling 15 times a day how people must
change their delphi code to suit FPC.

Such situation is effectively the end of compatibility, the end of sold
components for Lazarus (which works by the virtue of minimal changes
needed).
Hans-Peter Diettrich
2011-10-11 07:21:35 UTC
Permalink
Post by Marco van de Voort
But now we keep using sysutils, and the rest of the RTL. If there is
anything that I want to avoid is telling 15 times a day how people must
change their delphi code to suit FPC.
Such situation is effectively the end of compatibility, the end of sold
components for Lazarus (which works by the virtue of minimal changes
needed).
+1

DoDi
Jonas Maebe
2011-10-11 20:34:02 UTC
Permalink
Post by Marco van de Voort
I know that Florian and you wanted to see the default string as something of a
dialect mode, but I never saw a way to do that practically.
How about this: a new language feature is added to the compiler that enables defining a type alias that resolves to a different type depending on whether {$modeswitch unicodestrings} is active in the current code. If necessary, it could also be extended for functions/procedures (but I'd like to use it as sparingly as possible).

E.g. (with the first statement obviously in need of being replaced with something more legible)

type
// needs to be defined in advance with this particular syntax so that the
// compiler will write "tstringlist" as the type name in the RTTI of the
// two classes below;
// the compiler will still generate separate RTTI for both classes though
tstringlist = FpcStringModeDifferentiatedType(tansistringlist,tunicodestringlist);

tansistringlist = class
<ansi stringlist code>
end;

tunicodestringlist = class
<unicode stringlist code>
end;

(obviously, you could also implement both types using generics, include files and macro substitution, etc). An alternative could be to extend the syntax for generics or specializations and incorporate such functionality there, but that is only practical if in all cases the two different classes can be expressed using a single generic implementation.

Such a feature would enable duplicating functionality where absolutely necessary for compatibility reasons (e.g., inside the classes unit) without adding the complexity of having two completely separate RTLs. And the result should be completely compatible with both code written for an ansistring-based and for a unicodestring-based RTL. In fact, if I'm not missing anything it would also make combining components depending on an ansistring-based and on a unicodestring-based RTL into a single program possible (at the expense of including two copies of some data structures, of course, but nobody forces you to do this; it's simply an option that's available).


Jonas
Marco van de Voort
2011-10-11 09:00:18 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Martin
And what happens if an app did read data from some external source
(serial port) and then wants to declare what encoding it is?
That's not different from the current situation. Network communication
must agree about the encoding to use, and the user is responsible for
the correct encoding when local files are read or written. When files
are stored in UTF-8 (or UTF-16), a BOM will leave no room for wrong
guessing.
Note that many places that are runtime typed (like tstringlist.loadfromfile)
get a encoding parameter, so that the loading code can convert the encoding
of the file (in encoding parameter) to whatever stringtype tstringlist uses
(typically utf8 or utf16 or "default")

In short, see it as if text now has a mandatory encoding attached. If the
runtime doesn't know the type, then it is not text, but binary, and you
should treat it as such.
Hans-Peter Diettrich
2011-10-11 21:09:32 UTC
Permalink
Post by Marco van de Voort
Note that many places that are runtime typed (like tstringlist.loadfromfile)
get a encoding parameter, so that the loading code can convert the encoding
of the file (in encoding parameter) to whatever stringtype tstringlist uses
(typically utf8 or utf16 or "default")
Do you refer to the Delphi or FPC implementation? I couldn't find yet an
encoding in FPC TStrings or TStringList.
Post by Marco van de Voort
In short, see it as if text now has a mandatory encoding attached. If the
runtime doesn't know the type, then it is not text, but binary, and you
should treat it as such.
Can you give an example, how the runtime can not know the type of an string?

DoDi
Sven Barth
2011-10-12 07:58:19 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Marco van de Voort
Note that many places that are runtime typed (like
tstringlist.loadfromfile)
get a encoding parameter, so that the loading code can convert the encoding
of the file (in encoding parameter) to whatever stringtype tstringlist uses
(typically utf8 or utf16 or "default")
Do you refer to the Delphi or FPC implementation? I couldn't find yet an
encoding in FPC TStrings or TStringList.
As our codepage aware string implementation is only at the beginning and
thus no changes to Classes, etc were done yet, I'd say that he's
speaking about the Delphi implementation (though FPC might follow here
later on).
Post by Hans-Peter Diettrich
Post by Marco van de Voort
In short, see it as if text now has a mandatory encoding attached. If the
runtime doesn't know the type, then it is not text, but binary, and you
should treat it as such.
Can you give an example, how the runtime can not know the type of an string?
If you just opened a file (no reading done yet) you don't know the
encoding. Or if the file has no BOM then you might want to guess the
encoding based on the content read. As long as you haven't guessed the
encoding (no matter whether the guess is right or wrong) you don't know
the encoding.

Regards,
Sven
Hans-Peter Diettrich
2011-10-12 10:37:29 UTC
Permalink
Post by Sven Barth
Post by Hans-Peter Diettrich
Post by Marco van de Voort
In short, see it as if text now has a mandatory encoding attached. If the
runtime doesn't know the type, then it is not text, but binary, and you
should treat it as such.
Can you give an example, how the runtime can not know the type of an string?
If you just opened a file (no reading done yet) you don't know the
encoding.
Right, but it's still a file, not a string.
Post by Sven Barth
Or if the file has no BOM then you might want to guess the
encoding based on the content read. As long as you haven't guessed the
encoding (no matter whether the guess is right or wrong) you don't know
the encoding.
Right2. When no encoding is specified, Delphi reads (part of) the file
into a byte array and looks for a BOM (TEncoding.GetBufferEncoding). If
none is found, the preferred encoding (of the TStrings/TStream instance)
is used, or system encoding as the last resort.

DoDi
Marco van de Voort
2011-10-11 09:13:06 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Luiz Americo Pereira Camara
components under Lazarus/LCL ecosystem that would need such change. Also
porting Delphi VCL components would be a lot harder.
IMO Lazarus (and FPC) should follow the Delphi way, with strictly
separate Unicode and pre-Unicode versions. Nobody can expect that new
VCL (Unicode) components can be back-ported to Ansi versions.
Please explain what you mean by "unicode" and what by "ansi" in your
statement. Without nuancing that, your statement is pretty much meaning
less.

And keep in mind that while porting Delphi code is an important factor in
FPC, but not the only reason. Many people use FPC for serverside or middle
tier development.

It is that why the Delphi model of utf16 strings is not really constructive
on *nix.
Hans-Peter Diettrich
2011-10-11 21:20:11 UTC
Permalink
Post by Marco van de Voort
Post by Hans-Peter Diettrich
Post by Luiz Americo Pereira Camara
components under Lazarus/LCL ecosystem that would need such change. Also
porting Delphi VCL components would be a lot harder.
IMO Lazarus (and FPC) should follow the Delphi way, with strictly
separate Unicode and pre-Unicode versions. Nobody can expect that new
VCL (Unicode) components can be back-ported to Ansi versions.
Please explain what you mean by "unicode" and what by "ansi" in your
statement. Without nuancing that, your statement is pretty much meaning
less.
AFAIR Delphi changed the string type to Unicode (UTF-16) in D2009, i.e.
D2007 was the last Ansi version.
Post by Marco van de Voort
And keep in mind that while porting Delphi code is an important factor in
FPC, but not the only reason. Many people use FPC for serverside or middle
tier development.
It is that why the Delphi model of utf16 strings is not really constructive
on *nix.
I understand that is annoying when the generic string encoding is ever
changed into UTF-16. What are the plans with FPC?

DoDi
Marco van de Voort
2011-10-11 09:21:44 UTC
Permalink
Post by Sven Barth
is that a new mode DelphiUnicode and a new modeswitch Unicodestrings
will be introduced which do exactly what Delphi 2009 has done: change
the default string to UnicodeString. But unlike Delphi this will be
possible on a per unit base. (That the RTL must be ready for something
like this is a different topic)
Not entirely. As said default string and rtl are linked, and linked in both
ways. Forcing default string in dialect modes, makes it impossible to
compile the same piece of code in two different encodings using two rtls.

And that touches everything that inherits from the base classes that uses
"string" in a virtual method (or maybe everything protected/public even)
falls in that category.

Stuffing it in dialect modes is a habit because it worked with the
shortstring->ansistring transition, but in that time we also essentially
changed RTL from TP-like to delphi-like, and never really had to write
significant code that worked in both TP as delphi-like modes.
Sven Barth
2011-10-11 11:54:58 UTC
Permalink
Post by Marco van de Voort
Post by Sven Barth
is that a new mode DelphiUnicode and a new modeswitch Unicodestrings
will be introduced which do exactly what Delphi 2009 has done: change
the default string to UnicodeString. But unlike Delphi this will be
possible on a per unit base. (That the RTL must be ready for something
like this is a different topic)
Not entirely. As said default string and rtl are linked, and linked in both
ways. Forcing default string in dialect modes, makes it impossible to
compile the same piece of code in two different encodings using two rtls.
And that touches everything that inherits from the base classes that uses
"string" in a virtual method (or maybe everything protected/public even)
falls in that category.
Stuffing it in dialect modes is a habit because it worked with the
shortstring->ansistring transition, but in that time we also essentially
changed RTL from TP-like to delphi-like, and never really had to write
significant code that worked in both TP as delphi-like modes.
Right... I guess I lost the overview of the problematic parts a bit...

Regards,
Sven
Marco van de Voort
2011-10-12 07:50:33 UTC
Permalink
Post by Hans-Peter Diettrich
Post by Marco van de Voort
Please explain what you mean by "unicode" and what by "ansi" in your
statement. Without nuancing that, your statement is pretty much meaning
less.
AFAIR Delphi changed the string type to Unicode (UTF-16) in D2009, i.e.
D2007 was the last Ansi version.
Point I was trying to make is that in D2009+ ansistring includes utf8 which
is also unicode. Therefore the term "unicode" is ambiguous. If you mean
the two byte type say utf16 or 2-byte type.
Post by Hans-Peter Diettrich
Post by Marco van de Voort
And keep in mind that while porting Delphi code is an important factor in
FPC, but not the only reason. Many people use FPC for serverside or middle
tier development.
It is that why the Delphi model of utf16 strings is not really constructive
on *nix.
I understand that is annoying when the generic string encoding is ever
changed into UTF-16. What are the plans with FPC?
Undecided. But I'm very strongly against utf16 default on unix. I don't do
much GUI on unix, and it would be insane to have a string type that is
totally different from all other string types that I touch.
Martin Schreiber
2011-10-12 08:59:54 UTC
Permalink
Post by Marco van de Voort
Undecided. But I'm very strongly against utf16 default on unix. I don't do
much GUI on unix, and it would be insane to have a string type that is
totally different from all other string types that I touch.
Do I understand it right that the constants, variables and properties in
classes.pas and db.pas which currently have the type "string" will be utf-8 on
Unix and utf-16 on Windows?

Martin
Sven Barth
2011-10-12 09:13:45 UTC
Permalink
Post by Martin Schreiber
Post by Marco van de Voort
Undecided. But I'm very strongly against utf16 default on unix. I don't do
much GUI on unix, and it would be insane to have a string type that is
totally different from all other string types that I touch.
Do I understand it right that the constants, variables and properties in
classes.pas and db.pas which currently have the type "string" will be utf-8 on
Unix and utf-16 on Windows?
There is no final decision on this topic, so it's hard to say what is
and what is not (the topic is not easy and we aren't far enough to even
have a fully working implementation of the dynamic string type).

Regards,
Sven
Martin Schreiber
2011-10-12 09:47:55 UTC
Permalink
Post by Sven Barth
Post by Martin Schreiber
Post by Marco van de Voort
Undecided. But I'm very strongly against utf16 default on unix. I don't
do much GUI on unix, and it would be insane to have a string type that
is totally different from all other string types that I touch.
Do I understand it right that the constants, variables and properties in
classes.pas and db.pas which currently have the type "string" will be
utf-8 on Unix and utf-16 on Windows?
There is no final decision on this topic, so it's hard to say what is
and what is not (the topic is not easy and we aren't far enough to even
have a fully working implementation of the dynamic string type).
I'd like to repeat my statement, for me as the author of MSEide+MSEgui the
current situation is ideal: string = AnsiString = 8 bit system encoding,
UnicodeString = utf-16.
What I had to do was to implement UnicodeString file- and other system-utils
and UnicodeString datalists and the like. All this is already available in
MSEgui.
If the char size of "string" is different on Unix and Linux I am most likely
forced to fork db.pas and classes.pas which maybe even is impossible because
of the necessary compiler magic. Hmm, what about RTTI? This most likely
prohibits forking. Do we really need here utf-16 in all strings?
A side note: I don't think that following Delphi blindly is always a good
idea. Have a look at Firemonkey and you know what I mean. ;-)

Martin
Graeme Geldenhuys
2011-10-12 12:17:57 UTC
Permalink
Post by Martin Schreiber
idea. Have a look at Firemonkey and you know what I mean. ;-)
For those unfamiliar with Firemonkey, would you mind explaining further.


...but over all, I do agree with your statement, that FPC shouldn't
follow Delphi blindly. Delphi and VCL is Windows centric - it's whole
design doesn't fit other platforms. CLX (and I guess Firemonkey) was/is
different different to VCL for a reason. Cross platform support needs
more thought, eg: UTF-8 as native string type under *nix systems, and
UTF-16 under Windows. Why must some platforms get a speed penalty and
others not, when you force only one encoding on all platforms?


As for you statement regarding "do we need Unicode support everywhere?"
Well, with Delphi 2009's Unicode support, the Delphi language now
supports Unicode too. Thus unit names, class names, property names,
variable names etc can all contain Unicode text in there names. So yes,
Unicode is required throughout the Object Pascal language, and FPC
Compiler. You can't have AnsiString only in some places, and Unicode
support in others. It's all or nothing.


Regards,
- Graeme -
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/
Sven Barth
2011-10-12 12:25:01 UTC
Permalink
Post by Graeme Geldenhuys
As for you statement regarding "do we need Unicode support everywhere?"
Well, with Delphi 2009's Unicode support, the Delphi language now
supports Unicode too. Thus unit names, class names, property names,
variable names etc can all contain Unicode text in there names. So yes,
Unicode is required throughout the Object Pascal language, and FPC
Compiler. You can't have AnsiString only in some places, and Unicode
support in others. It's all or nothing.
You are not completely correct. Support for let's call them non-ASCII
identifier was introduced by Delphi 2007, the version before the
Unicode-switch. So those two concepts don't need to be used together.

I also don't agree with the point that it needs to be all or nothing.
Allowing Unicode in Strings and String constants is a completely
different thing than allowing unicode characters in identifiers as well.
And the first does not need the second.

Regards,
Sven
Jonas Maebe
2011-10-12 12:32:38 UTC
Permalink
Post by Graeme Geldenhuys
eg: UTF-8 as native string type under *nix systems, and
UTF-16 under Windows. Why must some platforms get a speed penalty and
others not, when you force only one encoding on all platforms?
The reason for doing so would be to make code more easily portable.
Many frameworks use UTF-16 everywhere, from MSE to WxWidgets to Qt to
Java to Mac OS X' system frameworks (even though at the unix/posix
interface level, Mac OS X is also utf-8). That does not mean we have
to do the same, but neither is such a choice per definition guided by
being Windows-centric.

The main issue with the RTL is however, as far as I am concerned, not
that on some platforms an extra string conversion may be required here
or there, but compatibility with code written for D2009 and later, and
with code written for earlier Delphi/FPC versions.


Jonas
Martin Schreiber
2011-10-12 17:19:01 UTC
Permalink
Post by Jonas Maebe
Post by Graeme Geldenhuys
eg: UTF-8 as native string type under *nix systems, and
UTF-16 under Windows. Why must some platforms get a speed penalty and
others not, when you force only one encoding on all platforms?
The reason for doing so would be to make code more easily portable.
Many frameworks use UTF-16 everywhere, from MSE to WxWidgets to Qt to
Java to Mac OS X' system frameworks (even though at the unix/posix
interface level, Mac OS X is also utf-8). That does not mean we have
to do the same, but neither is such a choice per definition guided by
being Windows-centric.
The main issue with the RTL is however, as far as I am concerned, not
that on some platforms an extra string conversion may be required here
or there, but compatibility with code written for D2009 and later, and
with code written for earlier Delphi/FPC versions.
Interesting thread:
https://forums.codegear.com/thread.jspa?threadID=61763&tstart=0#399861

Martin
Martin Schreiber
2011-10-12 12:38:08 UTC
Permalink
Post by Graeme Geldenhuys
For those unfamiliar with Firemonkey, would you mind explaining further.
Read here for example:
https://forums.embarcadero.com/forum.jspa?forumID=380
Post by Graeme Geldenhuys
As for you statement regarding "do we need Unicode support everywhere?"
Well, with Delphi 2009's Unicode support, the Delphi language now
supports Unicode too. Thus unit names, class names, property names,
variable names etc can all contain Unicode text in there names. So yes,
Unicode is required throughout the Object Pascal language, and FPC
Compiler.
Is this desirable? What is the benefit of non ASCII Pascal identifiers at the
expense of performance and simplicity?

Martin
Graeme Geldenhuys
2011-10-12 13:07:22 UTC
Permalink
Thanks for the link.
Post by Martin Schreiber
Is this desirable? What is the benefit of non ASCII Pascal identifiers at the
expense of performance and simplicity?
No idea if it is desirable - probably not, when it is a global open
source project. But who am I to say that a Russian programmer may not
use his native language inside his Object Pascal code. It should be his
choice, not mine.



Regards,
- Graeme -
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/
Hans-Peter Diettrich
2011-10-12 11:38:09 UTC
Permalink
Post by Marco van de Voort
Post by Hans-Peter Diettrich
Post by Marco van de Voort
Please explain what you mean by "unicode" and what by "ansi" in your
statement. Without nuancing that, your statement is pretty much meaning
less.
AFAIR Delphi changed the string type to Unicode (UTF-16) in D2009, i.e.
D2007 was the last Ansi version.
Point I was trying to make is that in D2009+ ansistring includes utf8 which
is also unicode. Therefore the term "unicode" is ambiguous. If you mean
the two byte type say utf16 or 2-byte type.
Point taken.

UTF-16 strings already existed in the old (Ansi) versions, as
WideString. The new string=UnicodeString is a reference counted type,
and at the same time AnsiString was extended by a stored encoding and
element size. The StrRec record, prefixed to the string data, is the
same for AnsiString and UnicodeString, different from the old AnsiString
header. All new string types are strict, not alias. I mean all these
subtle differences, when talking about Ansi and Unicode *versions* of
Delphi.

DoDi
Marco van de Voort
2011-10-12 07:56:36 UTC
Permalink
Post by Jonas Maebe
Post by Marco van de Voort
I know that Florian and you wanted to see the default string as something of a
dialect mode, but I never saw a way to do that practically.
How about this: a new language feature is added to the compiler that
enables defining a type alias that resolves to a different type depending
on whether {$modeswitch unicodestrings} is active in the current code. If
necessary, it could also be extended for functions/procedures (but I'd
like to use it as sparingly as possible).
E.g. (with the first statement obviously in need of being replaced with something more legible)
type
// needs to be defined in advance with this particular syntax so that the
// compiler will write "tstringlist" as the type name in the RTTI of the
// two classes below;
// the compiler will still generate separate RTTI for both classes though
tstringlist = FpcStringModeDifferentiatedType(tansistringlist,tunicodestringlist);
tansistringlist = class
<ansi stringlist code>
end;
tunicodestringlist = class
<unicode stringlist code>
end;
(obviously, you could also implement both types using generics, include
files and macro substitution, etc). An alternative could be to extend the
syntax for generics or specializations and incorporate such functionality
there, but that is only practical if in all cases the two different
classes can be expressed using a single generic implementation.
Such a feature would enable duplicating functionality where absolutely
necessary for compatibility reasons (e.g., inside the classes unit)
without adding the complexity of having two completely separate RTLs. And
the result should be completely compatible with both code written for an
ansistring-based and for a unicodestring-based RTL. In fact, if I'm not
missing anything it would also make combining components depending on an
ansistring-based and on a unicodestring-based RTL into a single program
possible (at the expense of including two copies of some data structures,
of course, but nobody forces you to do this; it's simply an option that's
available).
If it was just one class it would work. But essentially it is all OOP. (e.g.
tcomponent and tcontrol has string properties, and thus the whole of
lazarus), same for the OOP parts of packages/ It would also mean rewriting
delphi code using such schemes to be encoding agnostic to follow this.

So sorry, but I don't see anything usable in this proposal. If I understand
it right it is a workaround that allows to switch a handful of classes in
FPC that way, at considerable cost (duplication), but disallows the user to
keep his code encoding agnostic.
Jonas Maebe
2011-10-12 08:13:51 UTC
Permalink
Post by Marco van de Voort
If it was just one class it would work. But essentially it is all OOP. (e.g.
tcomponent and tcontrol has string properties, and thus the whole of
lazarus),
Lazarus doesn't have to change anything. They are free to follow the
path you proposed for FPC: ship two completely separate LCLs (one
compiled with string = unicodestring and one compiled with string =
ansistring).
Post by Marco van de Voort
same for the OOP parts of packages/ It would also mean rewriting
delphi code using such schemes to be encoding agnostic to follow this.
If a class in the RTL or packages is by nature already encoding-
agnostic, the rewriting would consist of this:

type
tcomponent =
FpcStringModeDifferentiatedType(tansicomponent,tunicodecomponent);

generic tgenericcomponent<T> = class
..
end;

tansicomponent = specialize tcomponent<ansistring>;
tunicodecomponent = specialize tcomponent<unicodestring>;

(or use the Delphi variant of the generic syntax). That would indeed
require some ifdefs to keep the code compilable also by Delphi. No
solution will be completely free.


Jonas
Jonas Maebe
2011-10-12 08:23:34 UTC
Permalink
That would indeed require some ifdefs to keep the code compilable
also by Delphi. No solution will be completely free.
Well, an alternative could be to add a global directive such as

{$modeswitch duplicate_all_string_based_code}

whereby anything in that unit is first parsed as if the "string" types
were a generic type parameter, and the then compiler automatically
generates specializations for unicodestring and ansistring variants
for all declarations/implementations involving at least one string-
type entity.


Jonas
Continue reading on narkive:
Loading...