Message-ID:

Always look over your shoulder because everyone is watching and plotting against you.

devel / comp.lang.ada / Re: Ada and Unicode

Re: Ada and Unicode

<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>

https://www.rocksolidbbs.com/devel/article-flat.php?id=8141&group=comp.lang.ada#8141

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!cleanfeed3-b.proxad.net!nnrp1-2.free.fr!not-for-mail
From: fantome.forums.tDeContes@free.fr.invalid (Thomas)
Newsgroups: comp.lang.ada
Mail-Copies-To: nobody
Subject: Re: Ada and Unicode
References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
User-Agent: MT-NewsWatcher/3.5.3b3 (Intel Mac OS X)
Date: Sun, 03 Apr 2022 20:04:36 +0200
Message-ID: <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
Lines: 48
Organization: Guest of ProXad - France
NNTP-Posting-Date: 03 Apr 2022 20:04:37 CEST
NNTP-Posting-Host: 91.175.52.121
X-Trace: 1649009077 news-2.free.fr 13432 91.175.52.121:14286
X-Complaints-To: abuse@proxad.net

by: Thomas - Sun, 3 Apr 2022 18:04 UTC

In article <s5k0ai$bb5$1@dont-email.me>, "J-P. Rosen" <rosen@adalog.fr>
wrote:

> Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> > They're different types and should be incompatible, because, well, they
> > are. What does Ada have that allows for this that other languages
> > doesn't? Oh yeah! Types!
>
> They are not so different. For example, you may read the first line of a
> file in a string, then discover that it starts with a BOM, and thus
> decide it is UTF-8.

could you give me an example of sth that you can do yet, and you could
not do if UTF_8_String was private, please?
(to discover that it starts with a BOM, you must look at it.)

>
> BTW, the very first version of this AI had different types, but the ARG
> felt that it would just complicate the interface for the sake of abusive
> "purity".

could you explain "abusive purity" please?

i guess it is because of ASCII.
i guess a lot of developpers use only ASCII in a lot of situation, and
they would find annoying to need Ada.Strings.UTF_Encoding.Strings every
time.

but I think a simple explicit conversion is acceptable, for a not fully
compatible type which requires some attention.

the best would be to be required to use ASCII_String as intermediate,
but i don't know how it could be designed at language level:

UTF_8_Var := UTF_8_String (ASCII_String (Latin_1_Var));
Latin_1_Var:= String (ASCII_String (UTF_8_Var));

and this would be forbidden :
UTF_8_Var := UTF_8_String (Latin_1_Var);

this would ensures to raise Constraint_Error when there are somme
non-ASCII characters.

--
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

Re: Ada and Unicode

<t2knpr$s26$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8158&group=comp.lang.ada#8158

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rosen@adalog.fr (J-P. Rosen)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Wed, 6 Apr 2022 21:57:01 +0300
Organization: Adalog
Lines: 29
Message-ID: <t2knpr$s26$1@dont-email.me>
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 6 Apr 2022 18:56:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="0cdfdab607f832ca41c4c727bdce1faf";
logging-data="28742"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18LR6Nsi1pP3vv8FWsfjcLf"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Cancel-Lock: sha1:x3IRK7zir+UWDHOI72bsaBrxphE=
In-Reply-To: <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
Content-Language: fr

by: J-P. Rosen - Wed, 6 Apr 2022 18:57 UTC

Le 03/04/2022 à 21:04, Thomas a écrit :
>> They are not so different. For example, you may read the first line of a
>> file in a string, then discover that it starts with a BOM, and thus
>> decide it is UTF-8.
>
> could you give me an example of sth that you can do yet, and you could
> not do if UTF_8_String was private, please?
> (to discover that it starts with a BOM, you must look at it.)
Just what I said above, since a BOM is not a valid UTF-8 (otherwise, it
could not be recognized).

>>
>> BTW, the very first version of this AI had different types, but the ARG
>> felt that it would just complicate the interface for the sake of abusive
>> "purity".
>
> could you explain "abusive purity" please?
>
It was felt that in practice, being too strict in separating the types
would make things more difficult, without any practical gain. This has
been discussed - you may not agree with the outcome, but it was not made
out of pure lazyness

--
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52
https://www.adalog.fr

Re: Ada and Unicode

<t2lesj$d2f$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8159&group=comp.lang.ada#8159

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: randy@rrsoftware.com (Randy Brukardt)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Wed, 6 Apr 2022 20:30:58 -0500
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <t2lesj$d2f$1@dont-email.me>
References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me> <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr> <t2knpr$s26$1@dont-email.me>
Injection-Date: Thu, 7 Apr 2022 01:31:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="65c7ce06d269e04522287d23acb2d265";
logging-data="13391"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/4MVNPs6WAd0AYgGrVuYsfTwdUVtSSj8w="
Cancel-Lock: sha1:4oLs51vIh+kTiUig6vOfhipRC6g=
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246
X-RFC2646: Format=Flowed; Response
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-Priority: 3
X-MSMail-Priority: Normal

by: Randy Brukardt - Thu, 7 Apr 2022 01:30 UTC

"J-P. Rosen" <rosen@adalog.fr> wrote in message
news:t2knpr$s26$1@dont-email.me...
....
> It was felt that in practice, being too strict in separating the types
> would make things more difficult, without any practical gain. This has
> been discussed - you may not agree with the outcome, but it was not made
> out of pure lazyness

The problem with that, of course, is that it sends the wrong message
vis-a-vis strong typing and interfaces. If we abandon it at the first sign
of trouble, they we are saying that it isn't really that important.

In this particular case, the reason really came down to practicality: if you
want to do anything string-like with a UTF-8 string, making it a separate
type becomes painful. It wouldn't work with anything in Ada.Strings,
Ada.Text_IO, or Ada.Directories, even though most of the operations are
fine. And there was no political will to replace all of those things with
versions to use with proper universal strings.

Moreover, if you really want to do that, you have to hide much of the array
behavior of the Universal string. For instance, you can't allow willy-nilly
slicing or replacement: cutting a character representation in half or
setting an illegal representation has to be prohibited (operations that
would turn a valid string into an invalid string should always raise an
exception). That means you can't (directly) use built-in indexing and
slicing -- those have to go through some sort of functions. So you do pretty
much have to use a private type for universal strings (similar to
Ada.Strings.Bounded would be best, I think).

If you had an Ada-like language that used a universal UTF-8 string
internally, you then would have a lot of old and mostly useless operations
supported for array types (since things like slices are mainly useful for
string operations). So such a language should simplify the core
substantially by dropping many of those obsolete features (especially as
little of the library would be directly compatible anyway). So one should
end up with a new language that draws from Ada rather than something in Ada
itself. (It would be great if that language could make strings with
different capacities interoperable - a major annoyance with Ada. And
modernizing access types, generalizing resolution, and the like also would
be good improvements IMHO.)

Randy.

Re: Ada and Unicode

<lysfqolzrg.fsf@pushface.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8161&group=comp.lang.ada#8161

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!vNObJwB5W4WN632vBkQn9g.user.46.165.242.75.POSTED!not-for-mail
From: simon@pushface.org (Simon Wright)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Fri, 08 Apr 2022 09:56:19 +0100
Organization: Aioe.org NNTP Server
Message-ID: <lysfqolzrg.fsf@pushface.org>
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gioia.aioe.org; logging-data="7791"; posting-host="vNObJwB5W4WN632vBkQn9g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (darwin)
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:ANFMruGs1ocD/kItZWTXzY7B9SQ=

by: Simon Wright - Fri, 8 Apr 2022 08:56 UTC

"Randy Brukardt" <randy@rrsoftware.com> writes:

> If you had an Ada-like language that used a universal UTF-8 string
> internally, you then would have a lot of old and mostly useless
> operations supported for array types (since things like slices are
> mainly useful for string operations).

Just off the top of my head, wouldn't it be better to use UTF32-encoded
Wide_Wide_Character internally? (you would still have trouble with
e.g. national flag emojis :)

Re: Ada and Unicode

<t2ov3c$10au$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8162&group=comp.lang.ada#8162

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!hzzNxxMX5IPvnEV4b74Cww.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Fri, 8 Apr 2022 11:26:05 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t2ov3c$10au$1@gioia.aioe.org>
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="33118"; posting-host="hzzNxxMX5IPvnEV4b74Cww.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Fri, 8 Apr 2022 09:26 UTC

On 2022-04-08 10:56, Simon Wright wrote:
> "Randy Brukardt" <randy@rrsoftware.com> writes:
>
>> If you had an Ada-like language that used a universal UTF-8 string
>> internally, you then would have a lot of old and mostly useless
>> operations supported for array types (since things like slices are
>> mainly useful for string operations).
>
> Just off the top of my head, wouldn't it be better to use UTF32-encoded
> Wide_Wide_Character internally?

Yep, that is the exactly the problem, a confusion between interface and
implementation.

Encoding /= interface, e.g. an interface of a string viewed as an array
of characters. That interface just same for ASCII, Latin-1, EBCDIC,
RADIX50, UTF-8 etc strings. Why do you care what is inside?

Ada type system's inability to implement this interface is another
issue. Usefulness of this interface is yet another. For immutable
strings it is quite useful. For mutable strings it might appear too
constrained, e.g. for packed encodings like UTF-8 and UTF-16.

Also this interface should have nothing to do with the interface of an
UTF-8 string as an array of octets or the interface of an UTF-16LE
string as an array of little endian words.

Since Ada cannot separate these interfaces, for practical purposes,
Strings are arrays of octets considered as UTF-8 encoding. The rest goes
into coding guidelines under the title "never ever do this."

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Ada and Unicode

<lyfsmn2xjn.fsf@pushface.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8165&group=comp.lang.ada#8165

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!vNObJwB5W4WN632vBkQn9g.user.46.165.242.75.POSTED!not-for-mail
From: simon@pushface.org (Simon Wright)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Fri, 08 Apr 2022 20:19:08 +0100
Organization: Aioe.org NNTP Server
Message-ID: <lyfsmn2xjn.fsf@pushface.org>
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gioia.aioe.org; logging-data="55470"; posting-host="vNObJwB5W4WN632vBkQn9g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (darwin)
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:R18A7vRIOGoE1z5Qk0FnLIExUhc=

by: Simon Wright - Fri, 8 Apr 2022 19:19 UTC

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On 2022-04-08 10:56, Simon Wright wrote:
>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>
>>> If you had an Ada-like language that used a universal UTF-8 string
>>> internally, you then would have a lot of old and mostly useless
>>> operations supported for array types (since things like slices are
>>> mainly useful for string operations).
>>
>> Just off the top of my head, wouldn't it be better to use
>> UTF32-encoded Wide_Wide_Character internally?
>
> Yep, that is the exactly the problem, a confusion between interface
> and implementation.

Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.

> Encoding /= interface, e.g. an interface of a string viewed as an
> array of characters. That interface just same for ASCII, Latin-1,
> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?

With a user's hat on, I don't. Implementers might have a different point
of view.

Re: Ada and Unicode

<t2q3cb$bbt$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8167&group=comp.lang.ada#8167

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!hzzNxxMX5IPvnEV4b74Cww.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Fri, 8 Apr 2022 21:45:18 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t2q3cb$bbt$1@gioia.aioe.org>
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
<lyfsmn2xjn.fsf@pushface.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="11645"; posting-host="hzzNxxMX5IPvnEV4b74Cww.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Fri, 8 Apr 2022 19:45 UTC

On 2022-04-08 21:19, Simon Wright wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>
>> On 2022-04-08 10:56, Simon Wright wrote:
>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>
>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>> internally, you then would have a lot of old and mostly useless
>>>> operations supported for array types (since things like slices are
>>>> mainly useful for string operations).
>>>
>>> Just off the top of my head, wouldn't it be better to use
>>> UTF32-encoded Wide_Wide_Character internally?
>>
>> Yep, that is the exactly the problem, a confusion between interface
>> and implementation.
>
> Don't understand. My point was that *when you are implementing this* it
> mught be easier to deal with 32-bit charactrs/code points/whatever the
> proper jargon is than with UTF8.

I think it would be more difficult, because you will have to convert
from and to UTF-8 under the hood or explicitly. UTF-8 is de-facto
interface standard and I/O standard. That would be 60-70% of all cases
you need a string. Most string operations like search, comparison,
slicing are isomorphic between code points and octets. So you would win
nothing from keeping strings internally as arrays of code points.

The situation is comparable to Unbounded_Strings. The implementation is
relatively simple, but the user must carry the burden of calling
To_String and To_Unbounded_String all over the application and the
processor must suffer the overhead of copying arrays here and there.

>> Encoding /= interface, e.g. an interface of a string viewed as an
>> array of characters. That interface just same for ASCII, Latin-1,
>> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?
>
> With a user's hat on, I don't. Implementers might have a different point
> of view.

Sure, but in Ada philosophy their opinion should carry less weight,
than, say, in C.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Ada and Unicode

<t2r0mk$q4d$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8169&group=comp.lang.ada#8169

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: randy@rrsoftware.com (Randy Brukardt)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Fri, 8 Apr 2022 23:05:38 -0500
Organization: A noiseless patient Spider
Lines: 87
Message-ID: <t2r0mk$q4d$1@dont-email.me>
References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me> <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr> <t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me> <lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org> <lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
Injection-Date: Sat, 9 Apr 2022 04:05:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e67210af8ec07b0490c4686d225dc9ca";
logging-data="26765"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6osmb2tyMtfbfkruD/KEqzXVCGBeRrJc="
Cancel-Lock: sha1:xh7N6YYB0qMsXYLb8DM/uJ+v21k=
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246
X-RFC2646: Format=Flowed; Response
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-Priority: 3
X-MSMail-Priority: Normal

by: Randy Brukardt - Sat, 9 Apr 2022 04:05 UTC

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:t2q3cb$bbt$1@gioia.aioe.org...
> On 2022-04-08 21:19, Simon Wright wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>
>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>
>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>> internally, you then would have a lot of old and mostly useless
>>>>> operations supported for array types (since things like slices are
>>>>> mainly useful for string operations).
>>>>
>>>> Just off the top of my head, wouldn't it be better to use
>>>> UTF32-encoded Wide_Wide_Character internally?
>>>
>>> Yep, that is the exactly the problem, a confusion between interface
>>> and implementation.
>>
>> Don't understand. My point was that *when you are implementing this* it
>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>> proper jargon is than with UTF8.
>
> I think it would be more difficult, because you will have to convert from
> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
> standard and I/O standard. That would be 60-70% of all cases you need a
> string. Most string operations like search, comparison, slicing are
> isomorphic between code points and octets. So you would win nothing from
> keeping strings internally as arrays of code points.

I basically agree with Dmitry here. The internal representation is an
implementation detail, but it seems likely that you would want to store
UTF-8 strings directly; they're almost always going to be half the size
(even for languages using their own characters like Greek) and for most of
us, they'll be just a bit more than a quarter the size. The amount of bytes
you copy around matters; the number of operations where code points are
needed is fairly small.

The main problem with UTF-8 is representing the code point positions in a
way that they (a) aren't abused and (b) don't cost too much to calculate.
Just using character indexes is too expensive for UTF-8 and UTF-16
representations, and using octet indexes is unsafe (since the splitting a
character representation is a possibility). I'd probably use an abstract
character position type that was implemented with an octet index under the
covers.

I think that would work OK as doing math on those is suspicious with a UTF
representation. We're spoiled from using Latin-1 representations, of course,
but generally one is interested in 5 characters, not 5 octets. And the
number of octets in 5 characters depends on the string. So most of the sorts
of operations that I tend to do (for instance from some code I was fixing
earlier today):

if Fort'Length > 6 and then
Font(2..6) = "Arial" then

This would be a bad idea if one is using any sort of universal
representation -- you don't know how many octets is in the string literal so
you can't assume a number in the test string. So the slice is dangerous
(even though in this particular case it would be OK since the test string is
all Ascii characters -- but I wouldn't want users to get in the habit of
assuming such things).

[BTW, the above was a bad idea anyway, because it turns out that the
function in the Ada library returned bounds that don't start at 1. So the
slice was usually out of range -- which is why I was looking at the code.
Another thing that we could do without. Slices are evil, since they *seem*
to be the right solution, yet rarely are in practice without a lot of
hoops.]

> The situation is comparable to Unbounded_Strings. The implementation is
> relatively simple, but the user must carry the burden of calling To_String
> and To_Unbounded_String all over the application and the processor must
> suffer the overhead of copying arrays here and there.

Yes, but that happens because Ada doesn't really have a string abstraction,
so when you try to build one, you can't fully do the job. One presumes that
a new language with a universal UTF-8 string wouldn't have that problem. (As
previously noted, I don't see much point in trying to patch up Ada with a
bunch of UTF-8 string packages; you would need an entire new set of
Ada.Strings libraries and I/O libraries, and then you'd have all of the old
stuff messing up resolution, using the best names, and confusing everything.
A cleaner slate is needed.)

Randy.

Re: Ada and Unicode

<ly5yni3dnd.fsf@pushface.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8170&group=comp.lang.ada#8170

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!vNObJwB5W4WN632vBkQn9g.user.46.165.242.75.POSTED!not-for-mail
From: simon@pushface.org (Simon Wright)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Sat, 09 Apr 2022 08:43:34 +0100
Organization: Aioe.org NNTP Server
Message-ID: <ly5yni3dnd.fsf@pushface.org>
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
<lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
<t2r0mk$q4d$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gioia.aioe.org; logging-data="21787"; posting-host="vNObJwB5W4WN632vBkQn9g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (darwin)
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:mdS6T8kOK8g0IPH+Q+8E/wbhwu4=

by: Simon Wright - Sat, 9 Apr 2022 07:43 UTC

"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:t2q3cb$bbt$1@gioia.aioe.org...
>> On 2022-04-08 21:19, Simon Wright wrote:
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>>
>>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>>
>>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>>> internally, you then would have a lot of old and mostly useless
>>>>>> operations supported for array types (since things like slices are
>>>>>> mainly useful for string operations).
>>>>>
>>>>> Just off the top of my head, wouldn't it be better to use
>>>>> UTF32-encoded Wide_Wide_Character internally?
>>>>
>>>> Yep, that is the exactly the problem, a confusion between interface
>>>> and implementation.
>>>
>>> Don't understand. My point was that *when you are implementing this* it
>>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>>> proper jargon is than with UTF8.
>>
>> I think it would be more difficult, because you will have to convert from
>> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
>> standard and I/O standard. That would be 60-70% of all cases you need a
>> string. Most string operations like search, comparison, slicing are
>> isomorphic between code points and octets. So you would win nothing from
>> keeping strings internally as arrays of code points.
>
> I basically agree with Dmitry here. The internal representation is an
> implementation detail, but it seems likely that you would want to store
> UTF-8 strings directly; they're almost always going to be half the size
> (even for languages using their own characters like Greek) and for most of
> us, they'll be just a bit more than a quarter the size. The amount of bytes
> you copy around matters; the number of operations where code points are
> needed is fairly small.

Well, I don't have any skin in this game, so I'll shut up at this point.

Re: Ada and Unicode

<62515f7a$0$25324$426a74cc@news.free.fr>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8171&group=comp.lang.ada#8171

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!cleanfeed2-b.proxad.net!nnrp3-2.free.fr!not-for-mail
Date: Sat, 9 Apr 2022 12:27:04 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Content-Language: fr
Newsgroups: comp.lang.ada
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
<lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
<t2r0mk$q4d$1@dont-email.me>
From: 314@drpi.fr (DrPi)
Subject: Re: Ada and Unicode
In-Reply-To: <t2r0mk$q4d$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 118
Message-ID: <62515f7a$0$25324$426a74cc@news.free.fr>
Organization: Guest of ProXad - France
NNTP-Posting-Date: 09 Apr 2022 12:27:06 CEST
NNTP-Posting-Host: 82.65.30.55
X-Trace: 1649500026 news-1.free.fr 25324 82.65.30.55:51170
X-Complaints-To: abuse@proxad.net

by: DrPi - Sat, 9 Apr 2022 10:27 UTC

Le 09/04/2022 à 06:05, Randy Brukardt a écrit :
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:t2q3cb$bbt$1@gioia.aioe.org...
>> On 2022-04-08 21:19, Simon Wright wrote:
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>>
>>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>>
>>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>>> internally, you then would have a lot of old and mostly useless
>>>>>> operations supported for array types (since things like slices are
>>>>>> mainly useful for string operations).
>>>>>
>>>>> Just off the top of my head, wouldn't it be better to use
>>>>> UTF32-encoded Wide_Wide_Character internally?
>>>>
>>>> Yep, that is the exactly the problem, a confusion between interface
>>>> and implementation.
>>>
>>> Don't understand. My point was that *when you are implementing this* it
>>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>>> proper jargon is than with UTF8.
>>
>> I think it would be more difficult, because you will have to convert from
>> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
>> standard and I/O standard. That would be 60-70% of all cases you need a
>> string. Most string operations like search, comparison, slicing are
>> isomorphic between code points and octets. So you would win nothing from
>> keeping strings internally as arrays of code points.
>
> I basically agree with Dmitry here. The internal representation is an
> implementation detail, but it seems likely that you would want to store
> UTF-8 strings directly; they're almost always going to be half the size
> (even for languages using their own characters like Greek) and for most of
> us, they'll be just a bit more than a quarter the size. The amount of bytes
> you copy around matters; the number of operations where code points are
> needed is fairly small.
>
> The main problem with UTF-8 is representing the code point positions in a
> way that they (a) aren't abused and (b) don't cost too much to calculate.
> Just using character indexes is too expensive for UTF-8 and UTF-16
> representations, and using octet indexes is unsafe (since the splitting a
> character representation is a possibility). I'd probably use an abstract
> character position type that was implemented with an octet index under the
> covers.
>
> I think that would work OK as doing math on those is suspicious with a UTF
> representation. We're spoiled from using Latin-1 representations, of course,
> but generally one is interested in 5 characters, not 5 octets. And the
> number of octets in 5 characters depends on the string. So most of the sorts
> of operations that I tend to do (for instance from some code I was fixing
> earlier today):
>
> if Fort'Length > 6 and then
> Font(2..6) = "Arial" then
>
> This would be a bad idea if one is using any sort of universal
> representation -- you don't know how many octets is in the string literal so
> you can't assume a number in the test string. So the slice is dangerous
> (even though in this particular case it would be OK since the test string is
> all Ascii characters -- but I wouldn't want users to get in the habit of
> assuming such things).
>
> [BTW, the above was a bad idea anyway, because it turns out that the
> function in the Ada library returned bounds that don't start at 1. So the
> slice was usually out of range -- which is why I was looking at the code.
> Another thing that we could do without. Slices are evil, since they *seem*
> to be the right solution, yet rarely are in practice without a lot of
> hoops.]
>
>> The situation is comparable to Unbounded_Strings. The implementation is
>> relatively simple, but the user must carry the burden of calling To_String
>> and To_Unbounded_String all over the application and the processor must
>> suffer the overhead of copying arrays here and there.
>
> Yes, but that happens because Ada doesn't really have a string abstraction,
> so when you try to build one, you can't fully do the job. One presumes that
> a new language with a universal UTF-8 string wouldn't have that problem. (As
> previously noted, I don't see much point in trying to patch up Ada with a
> bunch of UTF-8 string packages; you would need an entire new set of
> Ada.Strings libraries and I/O libraries, and then you'd have all of the old
> stuff messing up resolution, using the best names, and confusing everything.
> A cleaner slate is needed.)
>
> Randy.
>
>

In Python-2, there is the same kind of problem. A string is a byte
array. This is the programmer responsibility to encode/decode to/from
UTF8/Latin1/... and to manage everything correctly. Litteral strings can
be considered as encoded or decoded depending on the notation ("" or u"").

In Python-3, a string is a character(glyph ?) array. The internal
representation is hidden to the programmer.
UTF8/Latin1/... encoded "strings" are of type bytes (byte array).
Writing/reading to/from a file is done with bytes type.
When writing/reading to/from a file in text mode, you have to specify
the encoding to use. The encoding/decoding is then internally managed.
As a general rule, all "external communications" are done with bytes
(byte array). This is the programmer responsability to encode/decode
where needed to convert from/to strings.
The source files (.py) are considered to be UTF8 encoded by default but
one can declare the actual encoding at the top of the file in a special
comment tag. When a badly encoded character is found, an exception is
raised at parsing time. So, literal strings are real strings, not bytes.

I think the Python-3 way of doing things is much more understandable and
really usable.

On the Ada side, I've still not understood how to correctly deal with
all this stuff.

Note : In Python-3, bytes type is not reserved to encoded "strings". It
is a versatile type for what it's named : a byte array.

Re: Ada and Unicode

<pjd35hp8i1l7qflnltn7r9cp61u1uh8dv2@4ax.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8174&group=comp.lang.ada#8174

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!buffer2.nntp.dca1.giganews.com!buffer1.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sat, 09 Apr 2022 11:46:04 -0500
From: wlfraed@ix.netcom.com (Dennis Lee Bieber)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Sat, 09 Apr 2022 12:46:04 -0400
Organization: IISS Elusive Unicorn
Message-ID: <pjd35hp8i1l7qflnltn7r9cp61u1uh8dv2@4ax.com>
References: <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me> <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr> <t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me> <lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org> <lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org> <t2r0mk$q4d$1@dont-email.me> <62515f7a$0$25324$426a74cc@news.free.fr>
User-Agent: ForteAgent/8.00.32.1272
X-No-Archive: yes
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 31
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-RNe4bGfTwIijB1l4maREDzfVEkoEaCtcJCxp/mlK4x+uNiOnfLQVwL9dYdqsn38iwwhUvwOyHoE3bZg!r/3FUQYqMgj25EvnF0b9Fj810sGTITRUrr38MWzhV9f6pBXCD3l+RpSyZd4K9MEUKPKs7qei
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 2645

by: Dennis Lee Bieber - Sat, 9 Apr 2022 16:46 UTC

On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the
following:

>
>In Python-3, a string is a character(glyph ?) array. The internal
>representation is hidden to the programmer.

<SNIP>
>
>On the Ada side, I've still not understood how to correctly deal with
>all this stuff.

One thing to take into account is that Python strings are immutable.
Changing the contents of a string requires constructing a new string from
parts that incorporate the change.

That allows for the second aspect -- even if not visible to a
programmer, Python (3) strings are not a fixed representation: If all
characters in the string fit in the 8-bit UTF range, that string is stored
using one byte per character. If any character uses a 16-bit UTF
representation, the entire string is stored as 16-bit characters (and
similar for 32-bit UTF points). Thus, indexing into the string is still
fast -- just needing to scale the index by the character width of the
entire string.

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

Re: Ada and Unicode

<6251d7b1$0$3427$426a74cc@news.free.fr>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8175&group=comp.lang.ada#8175

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!cleanfeed1-a.proxad.net!nnrp4-1.free.fr!not-for-mail
Date: Sat, 9 Apr 2022 20:59:59 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Subject: Re: Ada and Unicode
Content-Language: en-US
Newsgroups: comp.lang.ada
References: <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
<lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
<t2r0mk$q4d$1@dont-email.me> <62515f7a$0$25324$426a74cc@news.free.fr>
<pjd35hp8i1l7qflnltn7r9cp61u1uh8dv2@4ax.com>
From: 314@drpi.fr (DrPi)
In-Reply-To: <pjd35hp8i1l7qflnltn7r9cp61u1uh8dv2@4ax.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 31
Message-ID: <6251d7b1$0$3427$426a74cc@news.free.fr>
Organization: Guest of ProXad - France
NNTP-Posting-Date: 09 Apr 2022 21:00:01 CEST
NNTP-Posting-Host: 82.65.30.55
X-Trace: 1649530801 news-2.free.fr 3427 82.65.30.55:51589
X-Complaints-To: abuse@proxad.net

by: DrPi - Sat, 9 Apr 2022 18:59 UTC

Le 09/04/2022 à 18:46, Dennis Lee Bieber a écrit :
> On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the
> following:
>
>>
>> In Python-3, a string is a character(glyph ?) array. The internal
>> representation is hidden to the programmer.
>
> <SNIP>
>>
>> On the Ada side, I've still not understood how to correctly deal with
>> all this stuff.
>
> One thing to take into account is that Python strings are immutable.
> Changing the contents of a string requires constructing a new string from
> parts that incorporate the change.
>

Right. I forgot to mention it.

> That allows for the second aspect -- even if not visible to a
> programmer, Python (3) strings are not a fixed representation: If all
> characters in the string fit in the 8-bit UTF range, that string is stored
> using one byte per character. If any character uses a 16-bit UTF
> representation, the entire string is stored as 16-bit characters (and
> similar for 32-bit UTF points). Thus, indexing into the string is still
> fast -- just needing to scale the index by the character width of the
> entire string.
>

Thanks for clarifying.

Re: Ada and Unicode

<3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8176&group=comp.lang.ada#8176

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:ad4:5965:0:b0:440:fee0:bef2 with SMTP id eq5-20020ad45965000000b00440fee0bef2mr22833807qvb.68.1649570329949;
Sat, 09 Apr 2022 22:58:49 -0700 (PDT)
X-Received: by 2002:a81:e24c:0:b0:2eb:4513:3793 with SMTP id
z12-20020a81e24c000000b002eb45133793mr20842567ywl.204.1649570329733; Sat, 09
Apr 2022 22:58:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Sat, 9 Apr 2022 22:58:49 -0700 (PDT)
In-Reply-To: <62515f7a$0$25324$426a74cc@news.free.fr>
Injection-Info: google-groups.googlegroups.com; posting-host=2a02:c207:2020:5933:0:0:0:1;
posting-account=niG3UgoAAAD7iQ3takWjEn_gw6D9X3ww
NNTP-Posting-Host: 2a02:c207:2020:5933:0:0:0:1
References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org>
<s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org>
<s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me> <lysfqolzrg.fsf@pushface.org>
<t2ov3c$10au$1@gioia.aioe.org> <lyfsmn2xjn.fsf@pushface.org>
<t2q3cb$bbt$1@gioia.aioe.org> <t2r0mk$q4d$1@dont-email.me> <62515f7a$0$25324$426a74cc@news.free.fr>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com>
Subject: Re: Ada and Unicode
From: vgodunko@gmail.com (Vadim Godunko)
Injection-Date: Sun, 10 Apr 2022 05:58:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Vadim Godunko - Sun, 10 Apr 2022 05:58 UTC

On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:
>
> On the Ada side, I've still not understood how to correctly deal with
> all this stuff.
>
Take a look at https://github.com/AdaCore/VSS

Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.

I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect.

Re: Ada and Unicode

<6253290a$0$25333$426a74cc@news.free.fr>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8180&group=comp.lang.ada#8180

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!cleanfeed2-b.proxad.net!nnrp3-2.free.fr!not-for-mail
Date: Sun, 10 Apr 2022 20:59:20 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Subject: Re: Ada and Unicode
Content-Language: en-US
Newsgroups: comp.lang.ada
References: <607b5b20$0$27442$426a74cc@news.free.fr>
<86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
<s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
<s5k0ai$bb5$1@dont-email.me>
<fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
<t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
<lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
<lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
<t2r0mk$q4d$1@dont-email.me> <62515f7a$0$25324$426a74cc@news.free.fr>
<3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com>
From: 314@drpi.fr (DrPi)
In-Reply-To: <3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 13
Message-ID: <6253290a$0$25333$426a74cc@news.free.fr>
Organization: Guest of ProXad - France
NNTP-Posting-Date: 10 Apr 2022 20:59:22 CEST
NNTP-Posting-Host: 82.65.30.55
X-Trace: 1649617162 news-3.free.fr 25333 82.65.30.55:49263
X-Complaints-To: abuse@proxad.net

by: DrPi - Sun, 10 Apr 2022 18:59 UTC

Le 10/04/2022 à 07:58, Vadim Godunko a écrit :
> On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:
>>
>> On the Ada side, I've still not understood how to correctly deal with
>> all this stuff.
>>
> Take a look at https://github.com/AdaCore/VSS
>
> Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.
>
> I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect.

That's an interesting solution.

Re: Ada and Unicode

<t3359l$8u0$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=8182&group=comp.lang.ada#8182

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: randy@rrsoftware.com (Randy Brukardt)
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Tue, 12 Apr 2022 01:13:08 -0500
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <t3359l$8u0$1@dont-email.me>
References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me> <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr> <t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me> <lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org> <lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org> <t2r0mk$q4d$1@dont-email.me> <62515f7a$0$25324$426a74cc@news.free.fr> <3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com>
Injection-Date: Tue, 12 Apr 2022 06:13:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8fde082ee3f0ae2a21e197452c3adfd5";
logging-data="9152"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AB+uKnlppNkpxgMIBwsya2esjNij5o6w="
Cancel-Lock: sha1:ci59KMKBmkw2l3RFdurLE8eIBSg=
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246
X-RFC2646: Format=Flowed; Original
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-Priority: 3
X-MSMail-Priority: Normal

by: Randy Brukardt - Tue, 12 Apr 2022 06:13 UTC

"Vadim Godunko" <vgodunko@gmail.com> wrote in message
news:3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com...
....
>I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and
>programming languages; more cleaner types and API is a requirement now.

....which essentially means Ada is obsolete in your view, as String in
particular is way too embedded in the definition and the language-defined
units to use anything else. You'd end up with a mass of conversions to get
anything done (the main problem with Ada.Strings.Unbounded).

Or I suppose you could replace pretty much the entire library with a new
one. But now you have two of everything to confuse newcomers and you still
have a mass of old nonsense weighing down the language and complicating
implementations.

>The only case when old character/string types is really makes value is low
>resources embedded systems; ...

....which of course is at least 50% of the use of Ada, and probably closer to
90% of the money. Any solution for Ada has to continue to meet the needs of
embedded programmers. For instance, it would need to support fixed, bounded,
and unbounded versions (solely having unbounded strings would not work for
many applications, and indeed not just embedded systems need to restrict
those -- any long-running server has to control dynamic allocation)

>...in other cases their use generates a lot of hidden issues, which is very
>hard to detect.

At least some of which occur because a string is not an array, and the
forcible mapping to them never worked very well. The Z-80 Pascals that we
used to implement the very earliest versions of Ada had more functional
strings than Ada does (by being bounded and using a library for most
operations) - they would have been way easier to extend (as the Python ones
were, as an example).

Randy.

Subject	Author
Re: Ada and Unicode	Thomas
Re: Ada and Unicode	J-P. Rosen
Re: Ada and Unicode	Randy Brukardt
Re: Ada and Unicode	Simon Wright
Re: Ada and Unicode	Dmitry A. Kazakov
Re: Ada and Unicode	Simon Wright
Re: Ada and Unicode	Dmitry A. Kazakov
Re: Ada and Unicode	Randy Brukardt
Re: Ada and Unicode	Simon Wright
Re: Ada and Unicode	DrPi
Re: Ada and Unicode	Dennis Lee Bieber
Re: Ada and Unicode	DrPi
Re: Ada and Unicode	Vadim Godunko
Re: Ada and Unicode	DrPi
Re: Ada and Unicode	Randy Brukardt