Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

USENET would be a better laboratory if there were more labor and less oratory. -- Elizabeth Haley


devel / comp.lang.misc / Re: Storing strings

SubjectAuthor
* Storing stringsJames Harris
+* Re: Storing stringsStefan Ram
|`* Re: Storing stringsJames Harris
| `* Re: Storing stringsStefan Ram
|  +- Re: Storing stringsStefan Ram
|  `- Re: Storing stringsJames Harris
+* Re: Storing stringsDmitry A. Kazakov
|`* Re: Storing stringsJames Harris
| `* Re: Storing stringsDmitry A. Kazakov
|  `* Re: Storing stringsJames Harris
|   +* Re: Storing stringsBart
|   |`* Re: Storing stringsJames Harris
|   | `* Re: Storing stringsBart
|   |  `* Re: Storing stringsJames Harris
|   |   +* Re: Storing stringsBart
|   |   |`* Re: Storing stringsJames Harris
|   |   | `* Re: Storing stringsBart
|   |   |  `- Re: Storing stringsJames Harris
|   |   `- Re: Storing stringsCharles Lindsey
|   `* Re: Storing stringsDmitry A. Kazakov
|    `* Re: Storing stringsJames Harris
|     `* Re: Storing stringsDmitry A. Kazakov
|      +- Re: Storing stringsBart
|      `* Re: Storing stringsJames Harris
|       `* Re: Storing stringsDmitry A. Kazakov
|        `- Re: Storing stringsJames Harris
+* Re: Storing stringsBart
|`* Re: Storing stringsJames Harris
| `* Re: Storing stringsBart
|  `* Re: Storing stringsJames Harris
|   `* Re: Storing stringsBart
|    `- Re: Storing stringsJames Harris
`* Re: Storing stringsluserdroog
 `* Re: Storing stringsJames Harris
  +- Re: Storing stringsluserdroog
  `* Re: Storing stringsDavid Brown
   +* Re: Storing stringsBart
   |`- Re: Storing stringsDavid Brown
   `* Re: Storing stringsJames Harris
    `* Re: Storing stringsDavid Brown
     +* Re: Storing stringsBart
     |+* Re: Storing stringsJames Harris
     ||+* Re: Storing stringsBart
     |||+* Re: Storing stringsJames Harris
     ||||`* Re: Storing stringsBart
     |||| `* Re: Storing stringsJames Harris
     ||||  `* Re: Storing stringsDavid Brown
     ||||   `* Re: Storing stringsJames Harris
     ||||    `* Re: Storing stringsDavid Brown
     ||||     `- Re: Storing stringsJames Harris
     |||`* Re: Storing stringsDavid Brown
     ||| `* Re: Storing stringsJames Harris
     |||  `- Re: Storing stringsDavid Brown
     ||`- Re: Storing stringsDavid Brown
     |`- Re: Storing stringsDavid Brown
     `* Re: Storing stringsJames Harris
      `* Re: Storing stringsBart
       `- Re: Storing stringsJames Harris

Pages:123
Re: Storing strings

<js7ov1Fkp2kU1@mid.individual.net>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1868&group=comp.lang.misc#1868

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: chl@clerew.man.ac.uk (Charles Lindsey)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 17:01:21 +0000
Lines: 25
Message-ID: <js7ov1Fkp2kU1@mid.individual.net>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me> <tjjgu5$qnd$1@gioia.aioe.org>
<tjjomr$3jpit$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net kH618nRWBQPIlF1SD3AKIQRqvP1bqzF2M803kyX66u2YgDmzw=
Cancel-Lock: sha1:CVd9xMbDG3HdD1E8jdA3SEmLcTg=
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Content-Language: en-US
In-Reply-To: <tjjomr$3jpit$1@dont-email.me>
 by: Charles Lindsey - Sun, 30 Oct 2022 17:01 UTC

On 29/10/2022 18:42, James Harris wrote:

> As above, the language is meant to treat strings as arrays. So AISI it should
> not ascribe any particular meaning to their contents.

In that case you should have a look at Algol68, where string are arrays of
characters. Every array has an associated descriptor containing its bounds etc.
But more importantly a REF to an array is best implemented by including the
descriptor in the REF (being a strongly typed language, there is no necessity
for REFs to assorted other types to have the same length - they are not
necessarily just address values). This makes it easy to construct slices
(immutable) and REFs to slices (so the slice is mutable), thus providing many of
the features discussed in this thread. However there is no provision for
extending (or shortening) an array, other than to create a new space to copy it
into; one could provide library routines with smart features to avoid actual
copying in some cases, and with a friendly interface which did not expose the
messiness inside.

--
Charles H. Lindsey ---------At my New Home, still doing my own thing------
Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk
Email: chl@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

Re: Storing strings

<tjmcgv$7h72$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1869&group=comp.lang.misc#1869

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 17:33:19 +0000
Organization: A noiseless patient Spider
Lines: 72
Message-ID: <tjmcgv$7h72$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj67es$1bu4$1@gioia.aioe.org>
<tjc5jn$2hpui$3@dont-email.me> <tjdp6i$2cr$1@gioia.aioe.org>
<tjjbnu$3gqcc$2@dont-email.me> <tjk5oi$1s3k$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 30 Oct 2022 17:33:19 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="69052b4bd6df8310e5f54633e4d1bcde";
logging-data="247010"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196akNWg21OVHxIJ9u6W2lzyvYlkDlZ16Q="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:L7NWjORF4AkGsZ4t00u3ab68Ihk=
In-Reply-To: <tjk5oi$1s3k$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Sun, 30 Oct 2022 17:33 UTC

On 29/10/2022 22:25, Bart wrote:
> On 29/10/2022 15:01, James Harris wrote:
>> On 27/10/2022 12:14, Bart wrote:
>
>>> Most strings are fixed-length once created; strings that can grow are
>>> rare. You don't need a 'capacity' field for example (like C++'s
>>> Vector type).
>>
>> Having watched some videos on string storage recently I now think I
>> know what you mean by the capacity field - basically that a string
>> descriptor would consist of these fields:
>>
>>    start
>>    length
>>    capacity
>>
>> so that the string could be extended at the end (up to the capacity).
>> That may be a bit restrictive. A programmer might want to remove or
>> add characters at the beginning rather than just at the end, even
>> though such would be done less often.
>
> Doing a prepend is not a problem. What's critical is whether the new
> length is still within the current allocation. (Prepend requires
> shifting of the old string so is less efficient anyway.)

Well, see below.

>
> If a new allocation is needed, you may be copying data for both prepend
> and append.

Yes.

>
> With delete however, you may need to think about whether to /reduce/ the
> allocation size.

Agreed.

>
>> So what do you think of having a string descriptor more like
>>
>>    first
>>    past
>>    memfirst
>>    mempast
>>
>> where memfirst and mempast would define the allocated space in which
>> the string body would sit.
>
> What's the difference between 'first' and 'memfirst'?

memfirst would point to the start of the block in which the string body
existed. (first would point at the same address or a later address.)

> Would you have a
> string that doesn't start at the beginning of its allocated block?

Yes, that would be useful in some cases. For example, if deleting the
first part of a string one wouldn't want to be forced to copy the rest
of it down.

And there are cases where strings may be built by prepending. A classic
example is construction of a network frame. Each layer adds a header
which, naturally, has to go on the front.

--
James Harris

Re: Storing strings

<tjmd98$7h72$3@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1870&group=comp.lang.misc#1870

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 17:46:16 +0000
Organization: A noiseless patient Spider
Lines: 143
Message-ID: <tjmd98$7h72$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjk1ok$71a$1@gioia.aioe.org>
<tjlmth$4kfm$1@dont-email.me> <tjm16h$vrp$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 30 Oct 2022 17:46:16 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="69052b4bd6df8310e5f54633e4d1bcde";
logging-data="247010"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+qRaKvlzzy6feqxyCG2ngo7sjMLZGy23Q="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:1pcHkanRnvbye0zQyPHA27gpxBA=
In-Reply-To: <tjm16h$vrp$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Sun, 30 Oct 2022 17:46 UTC

On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
> On 2022-10-30 12:24, James Harris wrote:

...

>> I don't see where you see a multitude of representations of the null
>> string. AISI the empty string would simply have past equal to first in
>> all cases.
>
>    ...
>    (0..0)
>    (1..1)
>    (2..2)
>    ...
>    (n..n)
>    ...
>
> With pointers it becomes even worse as some of them might point to
> invalid addresses.

In the general case strings would live at arbitrary addresses so no
meaning could be inferred from any address. In all cases

past - first

would define the length of the string. If the length was zero then it
would be an empty string.

That said, if the string could be extended then both past and first
would have to point to allocated memory into which the extension could
take place.

...

>> What is your definition of a slice? Is it /part/ of an underlying
>> string or is it a /copy/ of part of a string? For example, if
>>
>>    string S = "abcde"
>>    slice T = S[1..3]  ;"bcd"
>>
>> then changes to T would do what to S?
>
> No idea. It depends. Is slice in your example an independent object?

A slice, at the moment, at least, would be a view of part of a string.
Extending the earlier example,

!_a_!_b_!_c_!_d_!

^ ^
! !
s_first s_past

The string in the example would be "abcd" and the slice, delimited by
s_first and s_past, would be the "bc" in the middle of the string. Note
that the contents of the slice would not include the element at s_past.

The slice would appear to be a string with constraints. By default its
contents could be updated in place but it could not be made longer or
shorter.

A callee which wanted only to read a string (a common case) or to update
a string in place should not have to care whether it was passed a string
or a slice. For such a case, strings and slices would be interchangeable.

>
> But considering this:
>
> declare
>    S : String := "abcde";
> begin
>    S (1..3) := "x"; -- Illegal in Ada

In Ada would the following be legal?

S (1..3) := "xxx"; --replacement same size as what it is replacing

I'd be happy with that.

>
> But should it be legal, then the result would be
>
>   "xde"
>
> Many implementations make this illegal because it would require either
> bounded or dynamically allocated unbounded string.
>
> You can consider make it legal for these, but then you would have
> different semantics of slices for different strings. And this would
> contradict the design principle of having all strings interchangeable
> regardless the implementation method.

I don't mind there being differences along the lines of 'constraints'
where a less-constrained object can be passed to a callee which expects
an object with such constraints or imposes more constraints, but not one
which needs fewer constraints.

...

>> I presume such constraints would be specified when objects are declared.
>
> Objects and/or subtypes. Depending on the language preferences. Note
> also that you can have constrained views of the same object. E.g. you
> have a mutable variable passed down as in-argument. That would be an
> immutable view of the same object.

Yes, and an immutable object could not be passed to a callee which
wanted a mutable object.

>
>> As a programmer how would you want to specify such constraints? Would
>> each have a reserved word, for example?
>
> In some cases constraints might be implied. But usually language have
> lots of [sub]type modifiers like
>
>    in, in out, out, constant
>    atomic, volatile, shared
>    aliased (can get pointers to)
>    external, static
>    public, private, protected (visibility constraints)
>    range, length, bounds
>    parameter AKA discriminant (general purpose constraint)
>    specific type AKA static/dynamic up/downcast (view as another type)
>    class-wide (view as a class of types rooted in this one)
>    ...
>    measurement unit

So you wouldn't have a keyword to indicate a constraint such as
"Non-sliding lower bound" which you mentioned before but IIUC you might
have some qualification of the 'bounds' keyword as in

bounds(^..)

to indicate an unchangeable lower bound (with ^ meaning the start of the
string)?

--
James Harris

Re: Storing strings

<tjmdl9$7h72$4@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1871&group=comp.lang.misc#1871

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 17:52:41 +0000
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <tjmdl9$7h72$4@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me> <tjjgu5$qnd$1@gioia.aioe.org>
<tjjomr$3jpit$1@dont-email.me> <tjk4dn$1bjf$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 30 Oct 2022 17:52:41 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="69052b4bd6df8310e5f54633e4d1bcde";
logging-data="247010"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ipHNj267A3bwCgDNarkdukGpDjbp3TCc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:e8DRnOwq/0lMjbF/Yy9UKRonjaw=
In-Reply-To: <tjk4dn$1bjf$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Sun, 30 Oct 2022 17:52 UTC

On 29/10/2022 22:02, Bart wrote:
> On 29/10/2022 18:42, James Harris wrote:
>> On 29/10/2022 16:30, Bart wrote:

>> Further, functions which /return/ a string would create the string and
>> return it whole.
>
> Not necessarily. My dynamic language can return a string which is a
> slice into another. (Slices are not exposed in this language; they are
> in the static one, where slices are distinct types.)
>
> Example:
>
>     func trim(s) =
>         if s.len=2 then return "" fi
>         return s[2..$-1]
>     end
>
> This trims the first and last character of string. But here it returns a
> slice into the original string. If I wanted a fresh copy, I'd have to
> use copy() inside the function, or copy() (or a special kind of
> assignment) outside it.

That's a challenging example. In a sense it returns either of two
different types: the caller could be handed a string or a slice.

--
James Harris

Re: Storing strings

<tjmesm$7h72$5@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1872&group=comp.lang.misc#1872

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 18:13:42 +0000
Organization: A noiseless patient Spider
Lines: 48
Message-ID: <tjmesm$7h72$5@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 30 Oct 2022 18:13:42 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="69052b4bd6df8310e5f54633e4d1bcde";
logging-data="247010"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+N17krJixlNmuvpWirwBXldarlwcRyO+Y="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:mBQP66YdFKzJ4NZ8/EWaVIyCO0k=
Content-Language: en-GB
In-Reply-To: <8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
 by: James Harris - Sun, 30 Oct 2022 18:13 UTC

On 30/10/2022 16:21, luserdroog wrote:
> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>> Do you guys have any thoughts on the best ways for strings of characters
>> to be stored?
>>
>> 1. There's the C way, of course, of reserving one value (zero) and using
>> it as a terminator.
>>
>> 2. There's the 'length prefix' option of putting the length of the
>> string in a machine word before the characters.
>>
>> 3. There's the 'double pointer' way of pointing at, say, first and past
>> (where 'past' means first plus length such that the second pointer
>> points one position beyond the last character).
>>
>> Any others?

...

> I think an exhaustive list of options would be very large if you're not
> pre-judging and filtering as you're adding options.
>
> 4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

>
> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
> can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

>
> 6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
> whole byte for metadata attached to each character.

That's definitely thinking outside the box. I can see it working if the
user (the programmer) wanted a string of 24-bit values but it could be
awkward in other cases such as if he wanted a string of 32-bit or 8-bit
values. I don't think I mentioned it but I'd like the programmer to be
able to choose what the elements of the string would be.

--
James Harris

Re: Storing strings

<tjmfp0$uml$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1873&group=comp.lang.misc#1873

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!lYnhq7byp2KtY/MFJZaCTw.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 19:28:50 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjmfp0$uml$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjk1ok$71a$1@gioia.aioe.org>
<tjlmth$4kfm$1@dont-email.me> <tjm16h$vrp$1@gioia.aioe.org>
<tjmd98$7h72$3@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="31445"; posting-host="lYnhq7byp2KtY/MFJZaCTw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
 by: Dmitry A. Kazakov - Sun, 30 Oct 2022 18:28 UTC

On 2022-10-30 18:46, James Harris wrote:
> On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
>> On 2022-10-30 12:24, James Harris wrote:
>
> ..
>
>>> I don't see where you see a multitude of representations of the null
>>> string. AISI the empty string would simply have past equal to first
>>> in all cases.
>>
>>     ...
>>     (0..0)
>>     (1..1)
>>     (2..2)
>>     ...
>>     (n..n)
>>     ...
>>
>> With pointers it becomes even worse as some of them might point to
>> invalid addresses.
>
> In the general case strings would live at arbitrary addresses so no
> meaning could be inferred from any address.

As I said, that is a problem which will preclude some implementations
and make others inefficient.

And in general using pointers is wasting space as strings are much shorter.

> In all cases
>
>   past - first
>
> would define the length of the string.

Not really. Usually negative length is considered equivalent to zero,
e.g. when iterating substrings. Other choices may consider iteration in
reverse, when bounds are inverted.

> If the length was zero then it would be an empty string.

Ergo, a special case to treat in many operations.

>> But considering this:
>>
>> declare
>>     S : String := "abcde";
>> begin
>>     S (1..3) := "x"; -- Illegal in Ada
>
> In Ada would the following be legal?

Yes, in Ada slice length is constrained as the string length is.

>   S (1..3) := "xxx";  --replacement same size as what it is replacing
>
> I'd be happy with that.

It is still not fully defined. You need to consider the issue of sliding
bounds. E.g.

S (2..4) (2) := 'x'; -- Assign a character

Now with sliding:

S (2..4) (2) := 'x' gives "abxde", x is second in the slice

without sliding

S (2..4) (2) := 'x' gives "axcde", x is at 2 in the original string

In Ada the right side slides, the left does not. Sliding the right side
allows doing logical things like:

S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

>> But should it be legal, then the result would be
>>
>>    "xde"
>>
>> Many implementations make this illegal because it would require either
>> bounded or dynamically allocated unbounded string.
>>
>> You can consider make it legal for these, but then you would have
>> different semantics of slices for different strings. And this would
>> contradict the design principle of having all strings interchangeable
>> regardless the implementation method.
>
> I don't mind there being differences along the lines of 'constraints'
> where a less-constrained object can be passed to a callee which expects
> an object with such constraints or imposes more constraints, but not one
> which needs fewer constraints.

No, the problem is with semantics. E.g. let in a subprogram you do

S (1..100) := ""; -- Remove the first 100 characters

S is a formal parameter. Then, depending on the actual parameter's
subtype it may succeed (for a bounded length string) or fail (for a
fixed length string). Such things are big no-no in language design,
because they become a nightmare for programmers to track.

>>> I presume such constraints would be specified when objects are declared.
>>
>> Objects and/or subtypes. Depending on the language preferences. Note
>> also that you can have constrained views of the same object. E.g. you
>> have a mutable variable passed down as in-argument. That would be an
>> immutable view of the same object.
>
> Yes, and an immutable object could not be passed to a callee which
> wanted a mutable object.

Yes, lifting a constraint is not possible in most cases. However,
dynamic cast is a counterexample. You can move the view up the
inheritance tree. But this is frowned upon since it enforces certain
implementations.

>>> As a programmer how would you want to specify such constraints? Would
>>> each have a reserved word, for example?
>>
>> In some cases constraints might be implied. But usually language have
>> lots of [sub]type modifiers like
>>
>>     in, in out, out, constant
>>     atomic, volatile, shared
>>     aliased (can get pointers to)
>>     external, static
>>     public, private, protected (visibility constraints)
>>     range, length, bounds
>>     parameter AKA discriminant (general purpose constraint)
>>     specific type AKA static/dynamic up/downcast (view as another type)
>>     class-wide (view as a class of types rooted in this one)
>>     ...
>>     measurement unit
>
> So you wouldn't have a keyword to indicate a constraint such as
> "Non-sliding lower bound" which you mentioned before but IIUC you might
> have some qualification of the 'bounds' keyword as in
>
>   bounds(^..)
>
> to indicate an unchangeable lower bound (with ^ meaning the start of the
> string)?

I am not sure if sliding constraint might be usable. It is a different
issue to constraining bounds because it involves operations like
assignment. And it is not clear how to implement such a constraint
effectively. Most constraints are either static (compile time), or
simple to represent, like bounds or type tags. Sliding might be
implemented as a flag, but then you will have to check it all the time.
Maybe not worth having it as a choice. And it is unclear what is the
unconstrained state, sliding or non-sliding? (:-))

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Storing strings

<tjmubg$ra1$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1874&group=comp.lang.misc#1874

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 22:37:36 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tjmubg$ra1$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me> <tjjgu5$qnd$1@gioia.aioe.org>
<tjjomr$3jpit$1@dont-email.me> <tjk4dn$1bjf$1@gioia.aioe.org>
<tjmdl9$7h72$4@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="27969"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2
 by: Bart - Sun, 30 Oct 2022 22:37 UTC

On 30/10/2022 17:52, James Harris wrote:
> On 29/10/2022 22:02, Bart wrote:
>> On 29/10/2022 18:42, James Harris wrote:
>>> On 29/10/2022 16:30, Bart wrote:
>
>>> Further, functions which /return/ a string would create the string
>>> and return it whole.
>>
>> Not necessarily. My dynamic language can return a string which is a
>> slice into another. (Slices are not exposed in this language; they are
>> in the static one, where slices are distinct types.)
>>
>> Example:
>>
>>      func trim(s) =
>>          if s.len=2 then return "" fi
>>          return s[2..$-1]
>>      end
>>
>> This trims the first and last character of string. But here it returns
>> a slice into the original string. If I wanted a fresh copy, I'd have
>> to use copy() inside the function, or copy() (or a special kind of
>> assignment) outside it.
>
> That's a challenging example. In a sense it returns either of two
> different types: the caller could be handed a string or a slice.
>

In this language, it only has a String type, not a Slice. Slicing is an
operation you apply on strings to yield another String object.

(Internally, it has to distinguish between owned strings and slices into
strings owned by other objects, but as I said that aspect is not exposed.)

Re: Storing strings

<6c9c0f83-c472-458b-b696-93c9fee23919n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1875&group=comp.lang.misc#1875

  copy link   Newsgroups: comp.lang.misc
X-Received: by 2002:a05:620a:4720:b0:6ee:dc2d:4729 with SMTP id bs32-20020a05620a472000b006eedc2d4729mr7282932qkb.36.1667178208327;
Sun, 30 Oct 2022 18:03:28 -0700 (PDT)
X-Received: by 2002:a05:620a:4694:b0:6ee:b286:281a with SMTP id
bq20-20020a05620a469400b006eeb286281amr7610823qkb.455.1667178208101; Sun, 30
Oct 2022 18:03:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.misc
Date: Sun, 30 Oct 2022 18:03:27 -0700 (PDT)
In-Reply-To: <tjmesm$7h72$5@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=24.107.184.18; posting-account=G1KGwgkAAAAyw4z0LxHH0fja6wAbo7Cz
NNTP-Posting-Host: 24.107.184.18
References: <tj5phf$1lggf$2@dont-email.me> <8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c9c0f83-c472-458b-b696-93c9fee23919n@googlegroups.com>
Subject: Re: Storing strings
From: mijoryx@yahoo.com (luserdroog)
Injection-Date: Mon, 31 Oct 2022 01:03:28 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3854
 by: luserdroog - Mon, 31 Oct 2022 01:03 UTC

On Sunday, October 30, 2022 at 1:13:44 PM UTC-5, James Harris wrote:
> On 30/10/2022 16:21, luserdroog wrote:
> > On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
> >> Do you guys have any thoughts on the best ways for strings of characters
> >> to be stored?
> >>
> >> 1. There's the C way, of course, of reserving one value (zero) and using
> >> it as a terminator.
> >>
> >> 2. There's the 'length prefix' option of putting the length of the
> >> string in a machine word before the characters.
> >>
> >> 3. There's the 'double pointer' way of pointing at, say, first and past
> >> (where 'past' means first plus length such that the second pointer
> >> points one position beyond the last character).
> >>
> >> Any others?
> ..
> > I think an exhaustive list of options would be very large if you're not
> > pre-judging and filtering as you're adding options.
> >
> > 4) [List|Array|Tuple|Iterator] of character objects
> You mean where the characters are stored individually (one per node)?

Yep. Fat characters, or whatever other scaffolding helps for the operations
you want to support.

> > 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
> > can be used to format the data to squeeze it into 7 bits.
> Interesting idea. It's certainly one I hadn't thought of.

It's used in the Classical Forth Dictionary header for the name field which
can be variable length. Often it's followed by a length byte and you store
the pointer to the length byte, using

(length - *length)

to actually get the pointer to the start.

> > 6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
> > whole byte for metadata attached to each character.
> That's definitely thinking outside the box. I can see it working if the
> user (the programmer) wanted a string of 24-bit values but it could be
> awkward in other cases such as if he wanted a string of 32-bit or 8-bit
> values. I don't think I mentioned it but I'd like the programmer to be
> able to choose what the elements of the string would be.

This is what I was using in my unfinished APL-like language. The 8 bits
of tag meant it was easy for a node to be a character or an integer (25 bit)
or a nested subarray or whatever ... mpfp number... symbol table.

So I didn't really need a string type *per se* because there's an array type
whose data elements are these 32bits of encoded whatevs. A string is
just a 1D array, or maybe an array of arrays or a 2D array padded out with
spaces on each line. You'd read it in or receive it initially as a 1D array probably
from a file. Oh, an element could also be a file.

Re: Storing strings

<tjour1$hlef$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1876&group=comp.lang.misc#1876

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Mon, 31 Oct 2022 17:58:09 +0100
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <tjour1$hlef$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 31 Oct 2022 16:58:09 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="4d2b627cf365c9e0922d41aaebda935d";
logging-data="579023"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19R9PsHTlhLusUFO0fnO4Jao5J+rqvfzWk="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:Vy4TFlbbdiLx6wwPBMny3tz35qw=
Content-Language: en-GB
In-Reply-To: <tjmesm$7h72$5@dont-email.me>
 by: David Brown - Mon, 31 Oct 2022 16:58 UTC

On 30/10/2022 19:13, James Harris wrote:
> On 30/10/2022 16:21, luserdroog wrote:
>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>>> Do you guys have any thoughts on the best ways for strings of characters
>>> to be stored?
>>>
>>> 1. There's the C way, of course, of reserving one value (zero) and using
>>> it as a terminator.
>>>
>>> 2. There's the 'length prefix' option of putting the length of the
>>> string in a machine word before the characters.
>>>
>>> 3. There's the 'double pointer' way of pointing at, say, first and past
>>> (where 'past' means first plus length such that the second pointer
>>> points one position beyond the last character).
>>>
>>> Any others?
>
> ..
>
>> I think an exhaustive list of options would be very large if you're not
>> pre-judging and filtering as you're adding options.
>>
>> 4) [List|Array|Tuple|Iterator] of character objects
>
> You mean where the characters are stored individually (one per node)?
>
>>
>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>> can be used to format the data to squeeze it into 7 bits.
>
> Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient, as
well as being inconsistent with everything else.

>
>>
>> 6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
>> whole byte for metadata attached to each character.
>
> That's definitely thinking outside the box. I can see it working if the
> user (the programmer) wanted a string of 24-bit values but it could be
> awkward in other cases such as if he wanted a string of 32-bit or 8-bit
> values. I don't think I mentioned it but I'd like the programmer to be
> able to choose what the elements of the string would be.
>

UCS4 is 31 bit, not 24 bit. Perhaps luserdroog is mixing it up with
UTF-32, which can be covered by 21 bits. (The extra 10 bits in UCS4
have never been anything but 0, but if you want to refer to a long
out-dated and obsolete standard, it should still be done so accurately.)

Do not make a new string or character storage system based around
anything obsolete - that includes UTF-7 and UCS4. Making lots of extra
work for yourself to support something that everyone else rejected as
complicated, unnecessary and unused decades ago, is just silly.

There are only two character encodings that make any kind of sense in a
modern language (i.e., anything from this century). UTF-8 and
/possibly/ UTF-32 for internal usage, if you find it more convenient for
the operations you want.

If you are using UTF-32 only for internal usage within the language, and
not for export (external function calls, file IO, etc.), then the high
byte is always unused - and therefore free for metadata if that's what
you want. I'm not convinced it is a good idea, but it's certainly possible.

For any kind of interaction with anything else, UTF-8 is the standard.
There are a few other encodings that haven't died off completely yet,
but they are all on the way out.

I would also recommend treating characters and character strings as
something very different from raw bytes and binary blobs. Users want to
do very different things with them, and many of the useful operations
are completely different. Some languages have made the mistake of
conflating the two concepts - it's difficult to fix once that design
flaw is set into a language.

Re: Storing strings

<tjp0pf$1tdo$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1877&group=comp.lang.misc#1877

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Mon, 31 Oct 2022 17:31:28 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tjp0pf$1tdo$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="62904"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2
 by: Bart - Mon, 31 Oct 2022 17:31 UTC

On 31/10/2022 16:58, David Brown wrote:
> On 30/10/2022 19:13, James Harris wrote:
>> On 30/10/2022 16:21, luserdroog wrote:
>>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>>>> Do you guys have any thoughts on the best ways for strings of
>>>> characters
>>>> to be stored?
>>>>
>>>> 1. There's the C way, of course, of reserving one value (zero) and
>>>> using
>>>> it as a terminator.
>>>>
>>>> 2. There's the 'length prefix' option of putting the length of the
>>>> string in a machine word before the characters.
>>>>
>>>> 3. There's the 'double pointer' way of pointing at, say, first and past
>>>> (where 'past' means first plus length such that the second pointer
>>>> points one position beyond the last character).
>>>>
>>>> Any others?
>>
>> ..
>>
>>> I think an exhaustive list of options would be very large if you're not
>>> pre-judging and filtering as you're adding options.
>>>
>>> 4) [List|Array|Tuple|Iterator] of character objects
>>
>> You mean where the characters are stored individually (one per node)?
>>
>>>
>>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>>> can be used to format the data to squeeze it into 7 bits.
>>
>> Interesting idea. It's certainly one I hadn't thought of.
>
> Nor should you - that is a crazy idea.  It is massively inefficient, as
> well as being inconsistent with everything else.

It's a perfectly fine idea - for the 1970s.

(Now the 8th bit is better put to use to represent UTF8.)

Re: Storing strings

<tjp7vp$ib78$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1878&group=comp.lang.misc#1878

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Mon, 31 Oct 2022 20:34:17 +0100
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <tjp7vp$ib78$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tjp0pf$1tdo$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 31 Oct 2022 19:34:17 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="4d2b627cf365c9e0922d41aaebda935d";
logging-data="601320"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18eKdVd2ngLwR2Rkm4Gy7YQg6hfxSiwRW8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:clRhfxpg6CjzqDJQ1BxVoSLChhk=
Content-Language: en-GB
In-Reply-To: <tjp0pf$1tdo$1@gioia.aioe.org>
 by: David Brown - Mon, 31 Oct 2022 19:34 UTC

On 31/10/2022 18:31, Bart wrote:
> On 31/10/2022 16:58, David Brown wrote:
>> On 30/10/2022 19:13, James Harris wrote:
>>> On 30/10/2022 16:21, luserdroog wrote:
>>>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>>>>> Do you guys have any thoughts on the best ways for strings of
>>>>> characters
>>>>> to be stored?
>>>>>
>>>>> 1. There's the C way, of course, of reserving one value (zero) and
>>>>> using
>>>>> it as a terminator.
>>>>>
>>>>> 2. There's the 'length prefix' option of putting the length of the
>>>>> string in a machine word before the characters.
>>>>>
>>>>> 3. There's the 'double pointer' way of pointing at, say, first and
>>>>> past
>>>>> (where 'past' means first plus length such that the second pointer
>>>>> points one position beyond the last character).
>>>>>
>>>>> Any others?
>>>
>>> ..
>>>
>>>> I think an exhaustive list of options would be very large if you're not
>>>> pre-judging and filtering as you're adding options.
>>>>
>>>> 4) [List|Array|Tuple|Iterator] of character objects
>>>
>>> You mean where the characters are stored individually (one per node)?
>>>
>>>>
>>>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>>>> can be used to format the data to squeeze it into 7 bits.
>>>
>>> Interesting idea. It's certainly one I hadn't thought of.
>>
>> Nor should you - that is a crazy idea.  It is massively inefficient,
>> as well as being inconsistent with everything else.
>
> It's a perfectly fine idea - for the 1970s.
>
> (Now the 8th bit is better put to use to represent UTF8.)

Indeed.

UTF-7 was invented in attempt to make an encoding for Unicode that would
work for email, since some email servers did not handle 8-bit characters
correctly, long ago in the dark ages. It was never formally a Unicode
encoding, and almost never used in practice.

Using 7-bit characters and the eighth bit for a terminator would be
pretty inefficient - 12.5% wasted space to get a single bit of useful
information per string. Not a good trade-off!

Re: Storing strings

<tk0ns4$1g6c3$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1879&group=comp.lang.misc#1879

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Thu, 3 Nov 2022 15:48:20 +0000
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <tk0ns4$1g6c3$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Nov 2022 15:48:20 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9439976950aa8acb557788830ed31f44";
logging-data="1579395"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5HuraHcqWDJfdjaJ7NXf+w3+U8uIzYBY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:4a056+lz7F+PWTp/wj4n0WXmPq0=
Content-Language: en-GB
In-Reply-To: <tjour1$hlef$2@dont-email.me>
 by: James Harris - Thu, 3 Nov 2022 15:48 UTC

On 31/10/2022 16:58, David Brown wrote:
> On 30/10/2022 19:13, James Harris wrote:
>> On 30/10/2022 16:21, luserdroog wrote:
>>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

>>>> Do you guys have any thoughts on the best ways for strings of
>>>> characters
>>>> to be stored?

...

>>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>>> can be used to format the data to squeeze it into 7 bits.
>>
>> Interesting idea. It's certainly one I hadn't thought of.
>
> Nor should you - that is a crazy idea.  It is massively inefficient, as
> well as being inconsistent with everything else.

The model I have chosen (at least, for now) is to have a string indexed
logically from zero (so indices do not need to be stored) and, for
implementation, delimited by two pointers.

The one downside I am aware of is that it will, at times, require
creation and destruction of a small descriptor. I'll have to see how the
approach works out in practice.

...

> I would also recommend treating characters and character strings as
> something very different from raw bytes and binary blobs.  Users want to
> do very different things with them, and many of the useful operations
> are completely different.  Some languages have made the mistake of
> conflating the two concepts - it's difficult to fix once that design
> flaw is set into a language.

That sounds interesting but I cannot tell what you have in mind.

One could consider strings as having two categories of operation: those
which involve only the memory used by strings such as allocation,
concatenation, insertion, deletion, etc; and those which care about the
contents of a string such as capitalisation, comparison, whitespace
recognition, parsing, etc. Why could the mechanics not apply to raw
bytes and blobs?

--
James Harris

Re: Storing strings

<tk0ocl$1g6c3$3@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1880&group=comp.lang.misc#1880

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Thu, 3 Nov 2022 15:57:09 +0000
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <tk0ocl$1g6c3$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjk1ok$71a$1@gioia.aioe.org>
<tjlmth$4kfm$1@dont-email.me> <tjm16h$vrp$1@gioia.aioe.org>
<tjmd98$7h72$3@dont-email.me> <tjmfp0$uml$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Nov 2022 15:57:09 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9439976950aa8acb557788830ed31f44";
logging-data="1579395"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Qxl8ZbLkDZg1fO71vK6FxNlIRLhn8r5s="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:OucTy4PY4BdZ/40TWRozPovqOgc=
In-Reply-To: <tjmfp0$uml$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Thu, 3 Nov 2022 15:57 UTC

On 30/10/2022 18:28, Dmitry A. Kazakov wrote:
> On 2022-10-30 18:46, James Harris wrote:

...

>> In Ada would the following be legal?
>
> Yes, in Ada slice length is constrained as the string length is.
>
>>    S (1..3) := "xxx";  --replacement same size as what it is replacing
>>
>> I'd be happy with that.
>
> It is still not fully defined. You need to consider the issue of sliding
> bounds. E.g.
>
>   S (2..4) (2) := 'x'; -- Assign a character
>
> Now with sliding:
>
>   S (2..4) (2) := 'x'  gives "abxde", x is second in the slice
>
> without sliding
>
>   S (2..4) (2) := 'x'  gives "axcde", x is at 2 in the original string

AISI the elements of strings and slices would always be accessed by
offset. That appears to be the 'sliding' model you mention.

>
> In Ada the right side slides, the left does not. Sliding the right side
> allows doing logical things like:
>
>   S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

I don't see any sliding there, only that chars 5 to 9 are copied over
chars 1 to 5 of the same string.

...

>
> I am not sure if sliding constraint might be usable. It is a different
> issue to constraining bounds because it involves operations like
> assignment. And it is not clear how to implement such a constraint
> effectively. Most constraints are either static (compile time), or
> simple to represent, like bounds or type tags. Sliding might be
> implemented as a flag, but then you will have to check it all the time.
> Maybe not worth having it as a choice. And it is unclear what is the
> unconstrained state, sliding or non-sliding? (:-))
>

Always sliding (if I understand the term)!

--
James Harris

Re: Storing strings

<tk0p5u$1g6c3$4@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1881&group=comp.lang.misc#1881

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Thu, 3 Nov 2022 16:10:38 +0000
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <tk0p5u$1g6c3$4@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me> <tjjgu5$qnd$1@gioia.aioe.org>
<tjjomr$3jpit$1@dont-email.me> <tjk4dn$1bjf$1@gioia.aioe.org>
<tjmdl9$7h72$4@dont-email.me> <tjmubg$ra1$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Nov 2022 16:10:38 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9439976950aa8acb557788830ed31f44";
logging-data="1579395"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+8KHdkXLwNHMd3et1iWhGFnFe6l4vLRhQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:V89LFZitO9jOTIMlBBt7rPSzT+k=
In-Reply-To: <tjmubg$ra1$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Thu, 3 Nov 2022 16:10 UTC

On 30/10/2022 22:37, Bart wrote:
> On 30/10/2022 17:52, James Harris wrote:
>> On 29/10/2022 22:02, Bart wrote:
>>> On 29/10/2022 18:42, James Harris wrote:
>>>> On 29/10/2022 16:30, Bart wrote:
>>
>>>> Further, functions which /return/ a string would create the string
>>>> and return it whole.
>>>
>>> Not necessarily. My dynamic language can return a string which is a
>>> slice into another. (Slices are not exposed in this language; they
>>> are in the static one, where slices are distinct types.)
>>>
>>> Example:
>>>
>>>      func trim(s) =
>>>          if s.len=2 then return "" fi
>>>          return s[2..$-1]
>>>      end
>>>
>>> This trims the first and last character of string. But here it
>>> returns a slice into the original string. If I wanted a fresh copy,
>>> I'd have to use copy() inside the function, or copy() (or a special
>>> kind of assignment) outside it.
>>
>> That's a challenging example. In a sense it returns either of two
>> different types: the caller could be handed a string or a slice.
>>
>
> In this language, it only has a String type, not a Slice. Slicing is an
> operation you apply on strings to yield another String object.
>
> (Internally, it has to distinguish between owned strings and slices into
> strings owned by other objects, but as I said that aspect is not exposed.)

As most uses of slices would be read-only but some would be read-write,
and there are various potential ways to implement a slice, it might be
sensible for me to do something like:

1. Have the default slice as read-only and the simplest to construct. If
s is a string or another slice then a slice of part of that string could
be constructed as

s(1..4)

2. Use keywords to effect other, more specialised, operations. For example,

s.slice_cow(1..4) ;a copy-on-write slice
s.slice_rw(1..4) ;a slice via which the base string can be changed
s.copy(1..4) ;copy the designated chars to a new string

--
James Harris

Re: Storing strings

<tk35qj$1que9$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1882&group=comp.lang.misc#1882

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Fri, 4 Nov 2022 14:58:42 +0100
Organization: A noiseless patient Spider
Lines: 97
Message-ID: <tk35qj$1que9$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 4 Nov 2022 13:58:43 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="71b5bf7afaa55002e9420261c33b0f2c";
logging-data="1931721"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ySGyBNevh6oqOJFqNkZF4N4+KKtCAwi0="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.9.1
Cancel-Lock: sha1:TCBzP6v6V+yMvcTjBLyRnx592CQ=
In-Reply-To: <tk0ns4$1g6c3$2@dont-email.me>
Content-Language: en-GB
 by: David Brown - Fri, 4 Nov 2022 13:58 UTC

On 03/11/2022 16:48, James Harris wrote:
> On 31/10/2022 16:58, David Brown wrote:
>> On 30/10/2022 19:13, James Harris wrote:
>>> On 30/10/2022 16:21, luserdroog wrote:
>>>> On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
>
>>>>> Do you guys have any thoughts on the best ways for strings of
>>>>> characters
>>>>> to be stored?
>
> ..
>
>>>> 5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
>>>> can be used to format the data to squeeze it into 7 bits.
>>>
>>> Interesting idea. It's certainly one I hadn't thought of.
>>
>> Nor should you - that is a crazy idea.  It is massively inefficient,
>> as well as being inconsistent with everything else.
>
> The model I have chosen (at least, for now) is to have a string indexed
> logically from zero (so indices do not need to be stored) and, for
> implementation, delimited by two pointers.
>
> The one downside I am aware of is that it will, at times, require
> creation and destruction of a small descriptor. I'll have to see how the
> approach works out in practice.

There is no universally ideal way to store a string - every method is
inefficient for some uses and operations. You just have to find
something that is good enough for typical uses in your language, add
support for connecting to external code (such as exporting C-style
strings), and make it possible for users to implement other types when
it makes more sense for the user code.

>
> ..
>
>> I would also recommend treating characters and character strings as
>> something very different from raw bytes and binary blobs.  Users want
>> to do very different things with them, and many of the useful
>> operations are completely different.  Some languages have made the
>> mistake of conflating the two concepts - it's difficult to fix once
>> that design flaw is set into a language.
>
> That sounds interesting but I cannot tell what you have in mind.
>

I mean you should consider a "string" to be a way of holding a sequence
of "character" units which can hold a code unit of UTF-8 (since any
other choice of character encoding is madness).

This should be different from a "byte", which would be an 8-bit unit of
raw memory (ignore the existence of machines that can't address 8-bit
memory units directly - they will never use your language). Arrays of
this type should always be contiguous, and used for raw memory access.
(You may also want larger types - memory16, memory32, memory64, etc., if
that is convenient for efficient usage.)

So when you read a block of data from a file, or send a block to a
network socket, it is an array of bytes - not a string. (You can have
high-level abstractions for a "text file" wrapper that can read and
write strings, but that's not fundamental.) If you have an equivalent
of C's "memcpy" function it should use bytes, not any kind of character
type. If you have something like C's "type-based aliasing rules", then
it is bytes that should have the special exception, not characters.

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or conversion
operations on them.

The concept of "signed char" and "unsigned char" in C is a serious
design flaw. A type designed to hold letters should not have a sign,
and should not be used to hold arbitrary raw, low-level data.

You might also consider not having a character type at all. Python 3
has no character types - "a" is a string, not a character.

> One could consider strings as having two categories of operation: those
> which involve only the memory used by strings such as allocation,
> concatenation, insertion, deletion, etc; and those which care about the
> contents of a string such as capitalisation, comparison, whitespace
> recognition, parsing, etc. Why could the mechanics not apply to raw
> bytes and blobs?
>

Operations on strings are those that are relevant to strings. It makes
no sense to capitalise a raw binary blob. It makes no sense to have
methods for linking or chaining sets of blobs - these are direct handles
into memory, and chaining, allocation, etc., are higher level operations.

Raw binary buffers require nothing more than an address and a size to
describe them - anything more, and it is too high level. (Again,
there's nothing wrong with providing higher level features and
interfaces, but they have to build on the fundamental ones.)

Re: Storing strings

<tk3je2$1muc$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1883&group=comp.lang.misc#1883

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Fri, 4 Nov 2022 17:50:59 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tk3je2$1muc$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="56268"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Bart - Fri, 4 Nov 2022 17:50 UTC

On 04/11/2022 13:58, David Brown wrote:

> Neither "byte" nor "character" should have any kind of arithmetic
> operators - they are not integers.  But you will need cast or conversion
> operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

> The concept of "signed char" and "unsigned char" in C is a serious
> design flaw.  A type designed to hold letters should not have a sign,
> and should not be used to hold arbitrary raw, low-level data.

Signed and unsigned chars are not so bad; presumably C intended these to
do the job of a 'byte' type for small integers. So it was just a poor
choice of name. (After all there is no separate type in C for bytes
holding character data.)

What's bad is that third kind: a 'plain char' type, which is
incompatible with both signed and unsigned char, even though it
necessarily needs to be one of the other on a specific platform. It
occurs in no other language, and causes problems within FFI APIs.

Re: Storing strings

<tk40ji$21394$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1884&group=comp.lang.misc#1884

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Fri, 4 Nov 2022 21:35:46 +0000
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <tk40ji$21394$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 4 Nov 2022 21:35:46 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="6f9bea4e4f6bac9f5068e21f9b9afa2d";
logging-data="2133284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19E3T+4jY4RYdavEHNAW10r+WL4rsALPuY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:oGgFltaEP8Bai2aSAz7CI7TThqU=
Content-Language: en-GB
In-Reply-To: <tk3je2$1muc$1@gioia.aioe.org>
 by: James Harris - Fri, 4 Nov 2022 21:35 UTC

On 04/11/2022 17:50, Bart wrote:
> On 04/11/2022 13:58, David Brown wrote:
>
>> Neither "byte" nor "character" should have any kind of arithmetic
>> operators - they are not integers.  But you will need cast or
>> conversion operations on them.
>
> Bytes are small integers, typically of u8 type.
>
> I can't see why arithmetic can't be done with them, unless you want a
> purer kind of language where arithmetic is only allowed on signed
> numbers, and bitwise ops only on unsigned numbers, which is usually
> going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to
characters (even though some languages permit such operations). For
example, neither

'a' * 5

nor even

'R' + 1

have any meaning over the set of characters. Prohibiting arithmetic on
them could be dome but would make classifying and manipulating
characters difficult unless one had a comprehensive set of library
functions such as

is_digit(char)
is_alphanum(locale, char)
is_lower(locale, char)
upper(locale, char)

and many more.

--
James Harris

Re: Storing strings

<tk43lf$106j$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1885&group=comp.lang.misc#1885

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Fri, 4 Nov 2022 22:28:00 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tk43lf$106j$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="32979"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Bart - Fri, 4 Nov 2022 22:28 UTC

On 04/11/2022 21:35, James Harris wrote:
> On 04/11/2022 17:50, Bart wrote:
>> On 04/11/2022 13:58, David Brown wrote:
>>
>>> Neither "byte" nor "character" should have any kind of arithmetic
>>> operators - they are not integers.  But you will need cast or
>>> conversion operations on them.
>>
>> Bytes are small integers, typically of u8 type.
>>
>> I can't see why arithmetic can't be done with them, unless you want a
>> purer kind of language where arithmetic is only allowed on signed
>> numbers, and bitwise ops only on unsigned numbers, which is usually
>> going to be a pain for all concerned.
>
> I think what David means is that arithmetic operations don't apply to
> characters

I was picking on the 'byte' type; it seems extraordinary that you
shouldn't be allowed to do arithmetic with them. If you can initialise a
byte value with a number like this:

byte a = 123

then it's a number!

> (even though some languages permit such operations). For
> example, neither
>
>   'a' * 5
>
> nor even
>
>   'R' + 1
>
> have any meaning over the set of characters.

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet? Why should code like this be made illegal:

a := a * 10 + (c - '0')

Then I realised I shouldn't be telling the programmer what they can and
can't do with characters, as there might be some perfectly valid
use-case that I simply hadn't thought of.

Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
some kind on encryption algorithm.

So now they are treated like integers, other than printing an array of
char or pointer to char assumes they are strings.

> Prohibiting arithmetic on
> them could be dome but would make classifying and manipulating
> characters difficult unless one had a comprehensive set of library
> functions such as
>
>   is_digit(char)
>   is_alphanum(locale, char)
>   is_lower(locale, char)
>   upper(locale, char)
>
> and many more.

As I said, you and I don't know all the possibilites. Of course there
would need to be conversions between char and int, but this can become a
nuisance.

Re: Storing strings

<tk5ebh$2dc61$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1886&group=comp.lang.misc#1886

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 10:36:33 +0000
Organization: A noiseless patient Spider
Lines: 109
Message-ID: <tk5ebh$2dc61$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 10:36:33 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="dbc9f1e964fd5b264ade5423501cf343";
logging-data="2535617"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jQNYYwseaCJSMtYeIU4hbazDo5O0WLJI="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:S6TVWRIo32PuiO6d0Nw6Dkx7I6o=
Content-Language: en-GB
In-Reply-To: <tk43lf$106j$1@gioia.aioe.org>
 by: James Harris - Sat, 5 Nov 2022 10:36 UTC

On 04/11/2022 22:28, Bart wrote:
> On 04/11/2022 21:35, James Harris wrote:
>> On 04/11/2022 17:50, Bart wrote:
>>> On 04/11/2022 13:58, David Brown wrote:
>>>
>>>> Neither "byte" nor "character" should have any kind of arithmetic
>>>> operators - they are not integers.  But you will need cast or
>>>> conversion operations on them.
>>>
>>> Bytes are small integers, typically of u8 type.
>>>
>>> I can't see why arithmetic can't be done with them, unless you want a
>>> purer kind of language where arithmetic is only allowed on signed
>>> numbers, and bitwise ops only on unsigned numbers, which is usually
>>> going to be a pain for all concerned.
>>
>> I think what David means is that arithmetic operations don't apply to
>> characters
>
> I was picking on the 'byte' type; it seems extraordinary that you
> shouldn't be allowed to do arithmetic with them. If you can initialise a
> byte value with a number like this:
>
>     byte a = 123
>
> then it's a number!
>
>> (even though some languages permit such operations). For example, neither
>>
>>    'a' * 5
>>
>> nor even
>>
>>    'R' + 1
>>
>> have any meaning over the set of characters.
>
> I actually had such a restriction for a while: char*5 wasn't allowed,
> but char+1 was. After all why on earth shouldn't you want the next
> character in that alphabet?

That's because 'R' + 1 may not be the next character in all alphabets.
Defining 'next' is more than difficult. It depends on intended collation
order which varies in different parts of the world and can even change
over time as authorities choose different collation orders. Some
plausible meanings of 'R' + 1:

'S' (as in ASCII)
'r' (what the user may want as sort order)
's' (what the user may want as sort order)
a non-character (as in EBCDIC)

Perhaps a pseudo-call would be better such as

char_plus(collation, 'R', 1)

where 'collation' would be used to determine what was the specified
number of characters away from 'R'.

> Why should code like this be made illegal:
>
>     a := a * 10 + (c - '0')

I'm not saying it should. I am in two minds about what's best. The
alternative is something like

a := a * 10 + digit_value(c)

>
> Then I realised I shouldn't be telling the programmer what they can and
> can't do with characters, as there might be some perfectly valid
> use-case that I simply hadn't thought of.

Agreed. Although the point of prohibiting arithmetic on characters is to
make multilingual programming easier, not harder.

...

>> Prohibiting arithmetic on them could be dome but would make
>> classifying and manipulating characters difficult unless one had a
>> comprehensive set of library functions such as
>>
>>    is_digit(char)
>>    is_alphanum(locale, char)
>>    is_lower(locale, char)
>>    upper(locale, char)
>>
>> and many more.
>
> As I said, you and I don't know all the possibilites.

Yes, the challenge of making multilingual programming easier is
providing a comprehensive and convenient set of conversions.

> Of course there
> would need to be conversions between char and int, but this can become a
> nuisance.

Aside from converting digits (and any other characters used as digits in
a higher number base) is there's any meaning to converting chars to/from
ints?

--
James Harris

Re: Storing strings

<tk5ghp$2ds0a$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1887&group=comp.lang.misc#1887

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 11:14:01 +0000
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <tk5ghp$2ds0a$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 11:14:01 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="dbc9f1e964fd5b264ade5423501cf343";
logging-data="2551818"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MRd/hJ5oqKfwCh52ePVFjewHQHa1WkDc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:TCx883iAuvOr22Y7Ry2EIFthuvI=
In-Reply-To: <tk35qj$1que9$1@dont-email.me>
Content-Language: en-GB
 by: James Harris - Sat, 5 Nov 2022 11:14 UTC

On 04/11/2022 13:58, David Brown wrote:
> On 03/11/2022 16:48, James Harris wrote:
>> On 31/10/2022 16:58, David Brown wrote:

...

>>> I would also recommend treating characters and character strings as
>>> something very different from raw bytes and binary blobs.  Users want
>>> to do very different things with them, and many of the useful
>>> operations are completely different.  Some languages have made the
>>> mistake of conflating the two concepts - it's difficult to fix once
>>> that design flaw is set into a language.
>>
>> That sounds interesting but I cannot tell what you have in mind.
>>
>
> I mean you should consider a "string" to be a way of holding a sequence
> of "character" units which can hold a code unit of UTF-8 (since any
> other choice of character encoding is madness).

We've discussed before that (IMO) Unicode is useful for physical
printing to paper or electronic rendering such as to PDF but that it's a
nightmare for programmers and users when it is used for any kind of
input so I won't go over that again except to say that AISI Unicode
should be handled by library functions rather than a language.

What I do have in mind is strings of 'containers' where a string might
be declared as of type

string of char8 -- meaning a string of char8 containers
string of char32 -- meaning a string of char32 containers

What goes in each 8-bit or 32-bit 'container' would be another matter.

That agnostic ideal is somewhat in tension with the desire to include
string literals in a program text. For that, as I've mentioned before,
my preference is to have the program text and any literals within it
written in ASCII and American English; supplementary files would express
the string literals in other languages.

For example,

print "Hello world"

would be accompanied by a file for French which included

"Hello world" --> "Bonjour le monde"

Naturally, multilingual programming is much more complex than that
simple example but it shows the basic idea. The compiler would be able
to check that language files had everything required for a given piece
of source code.

...

> Neither "byte" nor "character" should have any kind of arithmetic
> operators - they are not integers.  But you will need cast or conversion
> operations on them.

The char8 and char32 containers could omit support for arithmetic **if**
enough support routines were provided. But, as Bart says, it's difficult
to anticipate all such support routines that programmers might need.

>
> The concept of "signed char" and "unsigned char" in C is a serious
> design flaw.  A type designed to hold letters should not have a sign,
> and should not be used to hold arbitrary raw, low-level data.

OK.

> You might also consider not having a character type at all.  Python 3
> has no character types - "a" is a string, not a character.

I can see that as being possible. I keep coming across examples of where
a 1-character string would do just as well as a character. ATM, though,
I have a separate character type.

...

> Raw binary buffers require nothing more than an address and a size to
> describe them - anything more, and it is too high level.  (Again,
> there's nothing wrong with providing higher level features and
> interfaces, but they have to build on the fundamental ones.)

Maybe that's where I am going with this. What you describe does sound
rather like my idea of using chars as containers and putting char and
string handling in libraries.

Incidentally, AISI I could have strings of any data type, not just
chars. For example,

string of int16
string of struct {x: float, y: float}
string of function(int, bool) -> uint

I'll have to see how it works out in practice but the idea is to
separate the concept of a string (basically, storage layout) from the
concept of whatever type of element the string is made from.

--
James Harris

Re: Storing strings

<tk5p07$17hl$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1888&group=comp.lang.misc#1888

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 13:38:16 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tk5p07$17hl$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk5ebh$2dc61$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="40501"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Bart - Sat, 5 Nov 2022 13:38 UTC

On 05/11/2022 10:36, James Harris wrote:
> On 04/11/2022 22:28, Bart wrote:

>> I actually had such a restriction for a while: char*5 wasn't allowed,
>> but char+1 was. After all why on earth shouldn't you want the next
>> character in that alphabet?
>
> That's because 'R' + 1 may not be the next character in all alphabets.
> Defining 'next' is more than difficult. It depends on intended collation
> order which varies in different parts of the world and can even change
> over time as authorities choose different collation orders. Some
> plausible meanings of 'R' + 1:
>
>   'S' (as in ASCII)

You said elsewhere that you want to use ASCII within programs. Which is
it happens, corresponds to the first 128 points of Unicode. Here:

char c
for c in 'A'..'Z' do
print c
od

this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
+1 to 'c'; it doesn't care about collating order!

(This also illustrates the difference between `byte` and `char`; using a
byte type, the output would be '656667...90'.)

>   'r' (what the user may want as sort order)
>   's' (what the user may want as sort order)
>   a non-character (as in EBCDIC)

My feeling is that it is these diverse requirements that require
user-supplied functions.

My 'char' is still a thinly veiled numeric type, so ordinary integer
arithmatic can be used. Otherwise even something like this becomes
impossible:

['A'..'Z']int histogram

midpoint := (histogram.upb - histogram.lwb)/2

++histogram[midpoint+1]

This requires certain properties of array indicates, like being able to
do arithmetic, as well as being consecutive ordinal values.

> Perhaps a pseudo-call would be better such as
>
>   char_plus(collation, 'R', 1)
>
> where 'collation' would be used to determine what was the specified
> number of characters away from 'R'.

Sure, as I said, you can provide any interpretation you like. But if you
do C+1, you expect to get the code of the next character (or next
codepoint if venturing outside ASCII).

>
> Aside from converting digits (and any other characters used as digits in
> a higher number base) is there's any meaning to converting chars to/from
> ints?

My static language makes byte and char slightly different types. (Types
involving char may get printed differently.)

That meant that `ref byte` and `ref char` were incompatible, which
rapidly turned into a nightmare: I might have a readfile() routine that
returned a `ref byte` type, a pointer to a block of memory.

But I wanted to interpret that block as `ref char` - a string. So this
meant loads of casts to either `ref byte` or `ref char` to get things to
work, but it got too much (a bit like 'const poisoning' in C where it
just propagates everywhere). That was clearly the wrong approach.

In the end I relaxed the type rules so that `ref byte` and `ref char`
are compatible, and everything is now SO much simpler.

Re: Storing strings

<tk5pfp$1dsr$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1889&group=comp.lang.misc#1889

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 13:46:33 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tk5pfp$1dsr$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk5ghp$2ds0a$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="47003"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Bart - Sat, 5 Nov 2022 13:46 UTC

On 05/11/2022 11:14, James Harris wrote:
> On 04/11/2022 13:58, David Brown wrote:

>> I mean you should consider a "string" to be a way of holding a
>> sequence of "character" units which can hold a code unit of UTF-8
>> (since any other choice of character encoding is madness).
>
> We've discussed before that (IMO) Unicode is useful for physical
> printing to paper or electronic rendering such as to PDF but that it's a
> nightmare for programmers and users when it is used for any kind of
> input so I won't go over that again except to say that AISI Unicode
> should be handled by library functions rather than a language.
>
> What I do have in mind is strings of 'containers' where a string might
> be declared as of type
>
>   string of char8      -- meaning a string of char8 containers
>   string of char32     -- meaning a string of char32 containers
>
> What goes in each 8-bit or 32-bit 'container' would be another matter.
>
> That agnostic ideal is somewhat in tension with the desire to include
> string literals in a program text. For that, as I've mentioned before,
> my preference is to have the program text and any literals within it
> written in ASCII and American English; supplementary files would express
> the string literals in other languages.
>
> For example,
>
>   print "Hello world"
>
> would be accompanied by a file for French which included
>
>   "Hello world" --> "Bonjour le monde"
>
> Naturally, multilingual programming is much more complex than that
> simple example but it shows the basic idea. The compiler would be able
> to check that language files had everything required for a given piece
> of source code.

Is it? This pretty much all I did when I used to write internationalised
applications. Although that was only done for French, German and Dutch.

But that print example would be written like this:

print /"Hello World"

The "/" was a translation operator, so only certain strings were
translated. This also made it easy to scan source code to build a list
of messages, used to maintain the dictionary as entries were added,
deleted or modified.

The scheme did need some hints sometimes, written like this, to get
around ambiguities:

print /"Green!colour"
print /"Green!fresh"

The hint was usually filtered out.

But this is little to do with how strings are represented. Even in
English, messages may include characters like "£" (pound sign) which is
not part of ASCII.

So a way to represent Unicode within literals is still needed (didn't we
discuss this a couple of years ago?).

Re: Storing strings

<tk61o2$2hp0g$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1891&group=comp.lang.misc#1891

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 16:07:30 +0000
Organization: A noiseless patient Spider
Lines: 138
Message-ID: <tk61o2$2hp0g$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk5ebh$2dc61$2@dont-email.me>
<tk5p07$17hl$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 16:07:30 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="dbc9f1e964fd5b264ade5423501cf343";
logging-data="2679824"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Qb4kgCp8UKFQHO8+hM+i8mNwKYVWQuMs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:kAEl00DmNnupCptYGGZx2NCC7Og=
In-Reply-To: <tk5p07$17hl$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Sat, 5 Nov 2022 16:07 UTC

On 05/11/2022 13:38, Bart wrote:
> On 05/11/2022 10:36, James Harris wrote:
>> On 04/11/2022 22:28, Bart wrote:
>
>>> I actually had such a restriction for a while: char*5 wasn't allowed,
>>> but char+1 was. After all why on earth shouldn't you want the next
>>> character in that alphabet?
>>
>> That's because 'R' + 1 may not be the next character in all alphabets.
>> Defining 'next' is more than difficult. It depends on intended
>> collation order which varies in different parts of the world and can
>> even change over time as authorities choose different collation
>> orders. Some plausible meanings of 'R' + 1:
>>
>>    'S' (as in ASCII)
>
> You said elsewhere that you want to use ASCII within programs. Which is
> it happens, corresponds to the first 128 points of Unicode. Here:
>
>     char c
>     for c in 'A'..'Z' do
>         print c
>     od
>
> this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
> +1 to 'c'; it doesn't care about collating order!

That piece of code is fine for an English-speaking user but to a
Spaniard the alphabet has a character missing, and Greeks wouldn't agree
with it at all.

Where L is the locale why not allow something like

for c in L.alpha_first..L.alpha_last
print c
od

?

That should work in English or Spanish or Greek etc, shouldn't it?

>
> (This also illustrates the difference between `byte` and `char`; using a
> byte type, the output would be '656667...90'.)
>
>
>
>>    'r' (what the user may want as sort order)
>>    's' (what the user may want as sort order)
>>    a non-character (as in EBCDIC)
>
> My feeling is that it is these diverse requirements that require
> user-supplied functions.

Functions, yes, (or what appear to be functions) though surely they
should be part of a library that comes with the language.

>
> My 'char' is still a thinly veiled numeric type, so ordinary integer
> arithmatic can be used. Otherwise even something like this becomes
> impossible:
>
>     ['A'..'Z']int histogram
>
>     midpoint := (histogram.upb - histogram.lwb)/2
>
>     ++histogram[midpoint+1]
>
> This requires certain properties of array indicates, like being able to
> do arithmetic, as well as being consecutive ordinal values.

Why not

[L.alpha_first..L.alpha_last] int histogram

?

As for the calculations what about using L.ord and L.chr to convert
between chars and integers?

>
>> Perhaps a pseudo-call would be better such as
>>
>>    char_plus(collation, 'R', 1)
>>
>> where 'collation' would be used to determine what was the specified
>> number of characters away from 'R'.
>
> Sure, as I said, you can provide any interpretation you like. But if you
> do C+1, you expect to get the code of the next character (or next
> codepoint if venturing outside ASCII).

If you use codepoints then you might not get the next character in
sequence - as in the case of 'R' in ebcdic (you'd get a non-printing
character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
that a Spaniard would expect).

If the programmer wants "the next character in the alphabet" then
shouldn't the programming language or a standard library help him get
that irrespective of the human language the program is meant to be
processing?

>
>
>>
>> Aside from converting digits (and any other characters used as digits
>> in a higher number base) is there's any meaning to converting chars
>> to/from ints?
>
>
> My static language makes byte and char slightly different types. (Types
> involving char may get printed differently.)
>
> That meant that `ref byte` and `ref char` were incompatible, which
> rapidly turned into a nightmare: I might have a readfile() routine that
> returned a `ref byte` type, a pointer to a block of memory.
>
> But I wanted to interpret that block as `ref char` - a string. So this
> meant loads of casts to either `ref byte` or `ref char` to get things to
> work, but it got too much (a bit like 'const poisoning' in C where it
> just propagates everywhere). That was clearly the wrong approach.

C's const propagation sounds like Java with its horrible, and sticky,
exception propagation.

>
> In the end I relaxed the type rules so that `ref byte` and `ref char`
> are compatible, and everything is now SO much simpler.

Would there have been any value in defining a layout for the untyped
area of bytes (or parts thereof)? That's where I think I am headed.

--
James Harris

Re: Storing strings

<tk62d9$2hp0g$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1892&group=comp.lang.misc#1892

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 16:18:49 +0000
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <tk62d9$2hp0g$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk5ghp$2ds0a$2@dont-email.me> <tk5pfp$1dsr$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 16:18:49 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="dbc9f1e964fd5b264ade5423501cf343";
logging-data="2679824"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/UjsPqtpEW24/iCK2uBuiiPlhEi5JgdyU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:Ta8fc5ooMfXOCU1qZSqpTmCAX5I=
In-Reply-To: <tk5pfp$1dsr$1@gioia.aioe.org>
Content-Language: en-GB
 by: James Harris - Sat, 5 Nov 2022 16:18 UTC

On 05/11/2022 13:46, Bart wrote:
> On 05/11/2022 11:14, James Harris wrote:

...

>> For example,
>>
>>    print "Hello world"
>>
>> would be accompanied by a file for French which included
>>
>>    "Hello world" --> "Bonjour le monde"
>>
>> Naturally, multilingual programming is much more complex than that
>> simple example but it shows the basic idea. The compiler would be able
>> to check that language files had everything required for a given piece
>> of source code.
>
> Is it? This pretty much all I did when I used to write internationalised
> applications. Although that was only done for French, German and Dutch.

I imagine multilingual programming would be very difficult and that it's
something a language should help with but it's not something I have had
to do as yet.

>
> But that print example would be written like this:
>
>     print /"Hello World"
>
> The "/" was a translation operator, so only certain strings were
> translated. This also made it easy to scan source code to build a list
> of messages, used to maintain the dictionary as entries were added,
> deleted or modified.
>
> The scheme did need some hints sometimes, written like this, to get
> around ambiguities:
>
>     print /"Green!colour"
>     print /"Green!fresh"

That's cool. I have something similar which is trailing identifiers but
I had them down as specific rather than hints. For example,

print "Green" :GreenColor
print "Green" :GreenFresh

Then the accompanying language files could distinguish between the strings.

>
> The hint was usually filtered out.
>
> But this is little to do with how strings are represented. Even in
> English, messages may include characters like "£" (pound sign) which is
> not part of ASCII.

I have two ways to deal with that. If the program is to use a pound sign
in England but a Dollar sign in America, say, then the source would have
a dollar sign and there'd be a file to translate for use in England.

On the other hand, if the program was to use a pound sign in all cases
then I'd do as below.

>
> So a way to represent Unicode within literals is still needed (didn't we
> discuss this a couple of years ago?).
>

Yes. You may remember my preference was for a named character something like

pound_currency_string = "\PoundSterling/"

--
James Harris

Re: Storing strings

<tk63on$2hcgu$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1893&group=comp.lang.misc#1893

  copy link   Newsgroups: comp.lang.misc
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 17:41:59 +0100
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <tk63on$2hcgu$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 16:41:59 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f46651f95eb9d2b6d1e5a0d9a6da460f";
logging-data="2667038"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tXzlsd1m5s4xayVJvbdsVWGl/OCsDL1Y="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:C0k661OphNzcebfvOgiKPSF5XSI=
Content-Language: en-GB
In-Reply-To: <tk3je2$1muc$1@gioia.aioe.org>
 by: David Brown - Sat, 5 Nov 2022 16:41 UTC

On 04/11/2022 18:50, Bart wrote:
> On 04/11/2022 13:58, David Brown wrote:
>
>> Neither "byte" nor "character" should have any kind of arithmetic
>> operators - they are not integers.  But you will need cast or
>> conversion operations on them.
>
> Bytes are small integers, typically of u8 type.

That's true in some languages. In other languages they are character
types. (In C, they are both.)

My suggestion is that they should not be considered integers or
characters, but a low-level "raw data" type. A "byte" should represent
a byte of memory, or an octet in a network packet, or a byte of a file.
It doesn't make sense to do any arithmetic on it because it might be
part of some completely different data type.

A given "byte" might be part of the storage of a floating point number,
or the storage of an address or pointer, or anything else.

I've said this before - the strength of a programming language is mainly
determined by what you /cannot/ do, rather than what you /can/ do.
Design a language to make it as hard as possible to get things wrong,
while still being easy to do the things you want it to do. Keeping
integer types, character types, strings, and raw data independent can
help with that.

>
> I can't see why arithmetic can't be done with them, unless you want a
> purer kind of language where arithmetic is only allowed on signed
> numbers, and bitwise ops only on unsigned numbers, which is usually
> going to be a pain for all concerned.
>

Whether you have signed and unsigned integer types, what sizes you have,
whether you have an abstract "integer" type and explicitly ranged
subtypes (as in Ada) is another matter.

>> The concept of "signed char" and "unsigned char" in C is a serious
>> design flaw.  A type designed to hold letters should not have a sign,
>> and should not be used to hold arbitrary raw, low-level data.
>
> Signed and unsigned chars are not so bad;

Yes, they are bad. A "character" is a letter or other visual symbol.
Can you explain the difference between "a positive letter X" and "a
negative letter X" ? Of course you can't - it is utter nonsense. The
same goes for adding 6 to the Greek letter µ, or multiplying Ð by Å.
Even operations that might appear sensible in code, such as adding 1 to
a char, don't actually make sense - it is the operation of taking the
next letter in the alphabet that makes sense.

> presumably C intended these to
> do the job of a 'byte' type for small integers. So it was just a poor
> choice of name. (After all there is no separate type in C for bytes
> holding character data.)
>
> What's bad is that third kind: a 'plain char' type, which is
> incompatible with both signed and unsigned char, even though it
> necessarily needs to be one of the other on a specific platform. It
> occurs in no other language, and causes problems within FFI APIs.
>

Certainly having three different "char" types in C is bad. But the only
sensible choice is to have nothing but a plain "char" and use /integer/
types for numeric data. (Call them "u8" and "i8" if you prefer, rather
than the C names "uint8_t" and "int8_t".)

(These days I would consider not having any kind of character type at
all, unless it was a language targeting small embedded systems that need
maximal efficiency - by the time you have something that can hold any
UTF-8 character, you might as well just call it a "string".)


devel / comp.lang.misc / Re: Storing strings

Pages:123
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor