RetroBBS - comp.lang.misc - Re: Storing strings

Re: Storing strings

<tk649l$2hcgu$3@dont-email.me>

https://www.rocksolidbbs.com/devel/article-flat.php?id=1894&group=comp.lang.misc#1894

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 17:51:01 +0100
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <tk649l$2hcgu$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 16:51:01 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f46651f95eb9d2b6d1e5a0d9a6da460f";
logging-data="2667038"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188vaTj8VW1wr3ZCcXPQE4m+SsyzDPfdbs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:8OdN7tG2zKu+nbYKcI/zSHLi/wI=
In-Reply-To: <tk40ji$21394$1@dont-email.me>
Content-Language: en-GB

by: David Brown - Sat, 5 Nov 2022 16:51 UTC

On 04/11/2022 22:35, James Harris wrote:
> On 04/11/2022 17:50, Bart wrote:
>> On 04/11/2022 13:58, David Brown wrote:
>>
>>> Neither "byte" nor "character" should have any kind of arithmetic
>>> operators - they are not integers. But you will need cast or
>>> conversion operations on them.
>>
>> Bytes are small integers, typically of u8 type.
>>
>> I can't see why arithmetic can't be done with them, unless you want a
>> purer kind of language where arithmetic is only allowed on signed
>> numbers, and bitwise ops only on unsigned numbers, which is usually
>> going to be a pain for all concerned.
>
> I think what David means is that arithmetic operations don't apply to
> characters (even though some languages permit such operations). For
> example, neither
>
> 'a' * 5
>
> nor even
>
> 'R' + 1
>
> have any meaning over the set of characters. Prohibiting arithmetic on
> them could be dome but would make classifying and manipulating
> characters difficult unless one had a comprehensive set of library
> functions such as
>
> is_digit(char)
> is_alphanum(locale, char)
> is_lower(locale, char)
> upper(locale, char)
>
> and many more.
>
>

Yes.

You will, of course, need some kind of explicit conversion between
strings/characters and integers or bytes. But if you want operations
like "is_digit" or "upper" for more than plain ASCII, you need a
comprehensive library. (As a fine example, the uppercase of "i" in
English is the letter "I", while in Turkish it is the letter "İ".)

There are plenty of internationalisation libraries available - you
should be able to find something suitable (perhaps
<https://icu.unicode.org>) and make wrappers and interfaces to your new
language.

Re: Storing strings

<tk64st$2hcgu$4@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1896&group=comp.lang.misc#1896

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 18:01:17 +0100
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <tk64st$2hcgu$4@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 17:01:17 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f46651f95eb9d2b6d1e5a0d9a6da460f";
logging-data="2667038"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ux49lurrlccFtoxgIhEw2OokNGEJ5WeU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:LQQvaQ5GgFEAcKFeTncJPAG5xYo=
Content-Language: en-GB
In-Reply-To: <tk43lf$106j$1@gioia.aioe.org>

by: David Brown - Sat, 5 Nov 2022 17:01 UTC

On 04/11/2022 23:28, Bart wrote:
> On 04/11/2022 21:35, James Harris wrote:
>> On 04/11/2022 17:50, Bart wrote:
>>> On 04/11/2022 13:58, David Brown wrote:
>>>
>>>> Neither "byte" nor "character" should have any kind of arithmetic
>>>> operators - they are not integers. But you will need cast or
>>>> conversion operations on them.
>>>
>>> Bytes are small integers, typically of u8 type.
>>>
>>> I can't see why arithmetic can't be done with them, unless you want a
>>> purer kind of language where arithmetic is only allowed on signed
>>> numbers, and bitwise ops only on unsigned numbers, which is usually
>>> going to be a pain for all concerned.
>>
>> I think what David means is that arithmetic operations don't apply to
>> characters
>
> I was picking on the 'byte' type; it seems extraordinary that you
> shouldn't be allowed to do arithmetic with them. If you can initialise a
> byte value with a number like this:
>
> byte a = 123
>
> then it's a number!
>

If I were making the language, then you could not do such an
initialisation. A "byte" is raw data, not a number.

>> (even though some languages permit such operations). For example, neither
>>
>>    'a' * 5
>>
>> nor even
>>
>>    'R' + 1
>>
>> have any meaning over the set of characters.
>
> I actually had such a restriction for a while: char*5 wasn't allowed,
> but char+1 was. After all why on earth shouldn't you want the next
> character in that alphabet? Why should code like this be made illegal:
>
>     a := a * 10 + (c - '0')

Why not :

char a; // Use whatever syntax you prefer
int i; // and whatever type names you prefer

a = digit(i);

The function "digit" might be defined :

char digit(int i) {
return char(i + ord('0'));
}

You want to find the next letter after "x"? "char(ord(x) + 1)". Or
perhaps, like Pascal and Ada, "succ(x)".

A language has to let you do what you need to do - but you should be
required to write it /clearly/. It should not let you mix apples and
oranges without you saying exactly how you think apples and oranges
should be mixed in the given situation.

>
> Then I realised I shouldn't be telling the programmer what they can and
> can't do with characters, as there might be some perfectly valid
> use-case that I simply hadn't thought of.
>
> Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
> some kind on encryption algorithm.

You've hit the nail on the head. What does it mean to write "'a' * 5" ?
To some people it means one thing, to others it means something
different, and to most people in most circumstances it is meaningless.
Meaningless code should be a compile-time error. And you let the
programmer write /explicit/ code to say what he/she intends in other cases.

It is only if something is being used so often that explicit code is a
pain to read and write, that you should do anything implicitly here. So
if you are writing a language designed primarily for string processing,
you might consider having "'a' * 5" defined to be "aaaaa". Otherwise,
let the programmer write "repeat('a', 5)" or "'a'.repeat(5)", or
whatever suits the style of the language.

>
> So now they are treated like integers, other than printing an array of
> char or pointer to char assumes they are strings.
>
>> Prohibiting arithmetic on them could be dome but would make
>> classifying and manipulating characters difficult unless one had a
>> comprehensive set of library functions such as
>>
>>    is_digit(char)
>>    is_alphanum(locale, char)
>>    is_lower(locale, char)
>>    upper(locale, char)
>>
>> and many more.
>
> As I said, you and I don't know all the possibilites. Of course there
> would need to be conversions between char and int, but this can become a
> nuisance.
>

Re: Storing strings

<tk65al$2hcgu$5@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1897&group=comp.lang.misc#1897

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 5 Nov 2022 18:08:37 +0100
Organization: A noiseless patient Spider
Lines: 162
Message-ID: <tk65al$2hcgu$5@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk5ebh$2dc61$2@dont-email.me>
<tk5p07$17hl$1@gioia.aioe.org> <tk61o2$2hp0g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Nov 2022 17:08:37 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f46651f95eb9d2b6d1e5a0d9a6da460f";
logging-data="2667038"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QCo5ZJiGkZfXjxHHX70gxjjOZuPtVWVM="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:K+PYxbwIV3Vr81CLj49SoKtt01k=
In-Reply-To: <tk61o2$2hp0g$1@dont-email.me>
Content-Language: en-GB

by: David Brown - Sat, 5 Nov 2022 17:08 UTC

On 05/11/2022 17:07, James Harris wrote:
> On 05/11/2022 13:38, Bart wrote:
>> On 05/11/2022 10:36, James Harris wrote:
>>> On 04/11/2022 22:28, Bart wrote:
>>
>>>> I actually had such a restriction for a while: char*5 wasn't
>>>> allowed, but char+1 was. After all why on earth shouldn't you want
>>>> the next character in that alphabet?
>>>
>>> That's because 'R' + 1 may not be the next character in all
>>> alphabets. Defining 'next' is more than difficult. It depends on
>>> intended collation order which varies in different parts of the world
>>> and can even change over time as authorities choose different
>>> collation orders. Some plausible meanings of 'R' + 1:
>>>
>>>    'S' (as in ASCII)
>>
>> You said elsewhere that you want to use ASCII within programs. Which
>> is it happens, corresponds to the first 128 points of Unicode. Here:
>>
>>      char c
>>      for c in 'A'..'Z' do
>>          print c
>>      od
>>
>> this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by
>> adding +1 to 'c'; it doesn't care about collating order!
>
> That piece of code is fine for an English-speaking user but to a
> Spaniard the alphabet has a character missing, and Greeks wouldn't agree
> with it at all.
>
> Where L is the locale why not allow something like
>
> for c in L.alpha_first..L.alpha_last
>     print c
> od
>
> ?
>
> That should work in English or Spanish or Greek etc, shouldn't it?

Yes. Of course, it would not work so well in Chinese (where there is no
concept of alphabet), and would be a challenge for Arabic (where the
shape of letters varies enormously depending on where they come in
words). But it is definitely a step in the right direction. And makes
the code independent of the character encoding.

>
>>
>> (This also illustrates the difference between `byte` and `char`; using
>> a byte type, the output would be '656667...90'.)
>>
>>
>>
>>>    'r' (what the user may want as sort order)
>>>    's' (what the user may want as sort order)
>>>    a non-character (as in EBCDIC)
>>
>> My feeling is that it is these diverse requirements that require
>> user-supplied functions.
>
> Functions, yes, (or what appear to be functions) though surely they
> should be part of a library that comes with the language.
>

How much of a library you provide depends on the goals of the language.
You don't have to provide /everything/ !

>>
>> My 'char' is still a thinly veiled numeric type, so ordinary integer
>> arithmatic can be used. Otherwise even something like this becomes
>> impossible:
>>
>>      ['A'..'Z']int histogram
>>
>>      midpoint := (histogram.upb - histogram.lwb)/2
>>
>>      ++histogram[midpoint+1]
>>
>> This requires certain properties of array indicates, like being able
>> to do arithmetic, as well as being consecutive ordinal values.
>
> Why not
>
> [L.alpha_first..L.alpha_last] int histogram
>
> ?
>
> As for the calculations what about using L.ord and L.chr to convert
> between chars and integers?
>
>>
>>> Perhaps a pseudo-call would be better such as
>>>
>>>    char_plus(collation, 'R', 1)
>>>
>>> where 'collation' would be used to determine what was the specified
>>> number of characters away from 'R'.
>>
>> Sure, as I said, you can provide any interpretation you like. But if
>> you do C+1, you expect to get the code of the next character (or next
>> codepoint if venturing outside ASCII).
>
> If you use codepoints then you might not get the next character in
> sequence - as in the case of 'R' in ebcdic (you'd get a non-printing
> character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
> that a Spaniard would expect).
>
> If the programmer wants "the next character in the alphabet" then
> shouldn't the programming language or a standard library help him get
> that irrespective of the human language the program is meant to be
> processing?
>
>>
>>
>>>
>>> Aside from converting digits (and any other characters used as digits
>>> in a higher number base) is there's any meaning to converting chars
>>> to/from ints?
>>
>>
>> My static language makes byte and char slightly different types.
>> (Types involving char may get printed differently.)
>>
>> That meant that `ref byte` and `ref char` were incompatible, which
>> rapidly turned into a nightmare: I might have a readfile() routine
>> that returned a `ref byte` type, a pointer to a block of memory.
>>
>> But I wanted to interpret that block as `ref char` - a string. So this
>> meant loads of casts to either `ref byte` or `ref char` to get things
>> to work, but it got too much (a bit like 'const poisoning' in C where
>> it just propagates everywhere). That was clearly the wrong approach.
>
> C's const propagation sounds like Java with its horrible, and sticky,
> exception propagation.
>

Getting "const" right is something to think long and hard about. When
do you mean "constant", when do you mean "read-only", when do you mean
"I promise this data will never change", "I will assume this data will
never change", "I promise /I/ won't change this data via this
reference", "This data will be unchanged logically but may change in
underlying representation, such as using a cache of some sort", etc. ?

Constness is a hugely powerful concept, and something you definitely
want in a language. Modern language design fashion is to making things
constant be default and require explicit indication that they can
change. Some programming languages (pure functional programming
languages, for example) have /only/ constant data - there is no such
thing as variables.

>>
>> In the end I relaxed the type rules so that `ref byte` and `ref char`
>> are compatible, and everything is now SO much simpler.
>
> Would there have been any value in defining a layout for the untyped
> area of bytes (or parts thereof)? That's where I think I am headed.
>
>

Re: Storing strings

<tk7u35$31tel$2@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1906&group=comp.lang.misc#1906

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 6 Nov 2022 09:17:24 +0000
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <tk7u35$31tel$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk5ebh$2dc61$2@dont-email.me>
<tk5p07$17hl$1@gioia.aioe.org> <tk61o2$2hp0g$1@dont-email.me>
<tk65al$2hcgu$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 6 Nov 2022 09:17:25 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f5185ac52d07f932f74c2636f27743fe";
logging-data="3208661"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX190NMtnFINRooz8AJ6i3clLubnfIlVgUrI="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:BdEG++KK3G25wQc8P4c90s+Ajj0=
In-Reply-To: <tk65al$2hcgu$5@dont-email.me>
Content-Language: en-GB

by: James Harris - Sun, 6 Nov 2022 09:17 UTC

On 05/11/2022 17:08, David Brown wrote:
> On 05/11/2022 17:07, James Harris wrote:
>> On 05/11/2022 13:38, Bart wrote:

...

>>> My feeling is that it is these diverse requirements that require
>>> user-supplied functions.
>>
>> Functions, yes, (or what appear to be functions) though surely they
>> should be part of a library that comes with the language.
>>
>
> How much of a library you provide depends on the goals of the language.
> You don't have to provide /everything/ !

That's good to hear. :)

While I am trying to make it easy to invoke functions written by other
people ISTM that it's also right for the language to have associated
with it a load of standard provisions - i18n support, various data
structures, display support, maths libraries, etc for one simple reason:
code maintenance; it's easier to maintain software which uses library
calls one already knows than to have to learn yet another set of i18n
calls, for example.

...

>> C's const propagation sounds like Java with its horrible, and sticky,
>> exception propagation.
>>
>
> Getting "const" right is something to think long and hard about. When
> do you mean "constant", when do you mean "read-only", when do you mean
> "I promise this data will never change", "I will assume this data will
> never change", "I promise /I/ won't change this data via this
> reference", "This data will be unchanged logically but may change in
> underlying representation, such as using a cache of some sort", etc. ?

That sounds really interesting and I'd like to get in to it but this is
not the thread. If you wanted to start a new thread on the topic I would
reply. Suffice to say here that I don't use "const" but do have "ro" and
"rw" as usable in various contexts which effect many of the things you
mention but I don't know if I have covered everything a programmer might
need.

>
> Constness is a hugely powerful concept, and something you definitely
> want in a language. Modern language design fashion is to making things
> constant be default and require explicit indication that they can
> change. Some programming languages (pure functional programming
> languages, for example) have /only/ constant data - there is no such
> thing as variables.

I haven't gone that far but, for example, I have globals as, by default,
read only and while parameters are read-write a function would have to
keep the originals around if there's a chance they would be needed.

As I say, though, such things need a thread of their own so I'll resist
the urge to say more.

--
James Harris

Re: Storing strings

<tk7uns$31tel$3@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1907&group=comp.lang.misc#1907

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 6 Nov 2022 09:28:27 +0000
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <tk7uns$31tel$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk64st$2hcgu$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 6 Nov 2022 09:28:28 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f5185ac52d07f932f74c2636f27743fe";
logging-data="3208661"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KY49MHUcmg6gk0SOh+PfZgxIZrzVxJiE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:zyqtd9JZwlY3jtYr8qxc5RuDquw=
In-Reply-To: <tk64st$2hcgu$4@dont-email.me>
Content-Language: en-GB

by: James Harris - Sun, 6 Nov 2022 09:28 UTC

On 05/11/2022 17:01, David Brown wrote:
> On 04/11/2022 23:28, Bart wrote:

...

>> I actually had such a restriction for a while: char*5 wasn't allowed,
>> but char+1 was. After all why on earth shouldn't you want the next
>> character in that alphabet? Why should code like this be made illegal:
>>
>>      a := a * 10 + (c - '0')
>
> Why not :
>
>     char a;        // Use whatever syntax you prefer
>     int i;        // and whatever type names you prefer
>
>     a = digit(i);
>
> The function "digit" might be defined :
>
>     char digit(int i) {
>         return char(i + ord('0'));
>     }

Wouldn't char and ord need a locale?

That may be the wrong term but by locale I mean a bundled set of rules
(including, in this case, what the digits are and how many there are of
them) which apply to the language and region the program is executing for.

Maybe it is the right term. I see on Wikipedia: "In computing, a locale
is a set of parameters that defines the user's language, region and any
special variant preferences that the user wants to see in their user
interface."

https://en.wikipedia.org/wiki/Locale_(computer_software)

>
> You want to find the next letter after "x"? "char(ord(x) + 1)". Or
> perhaps, like Pascal and Ada, "succ(x)".

pred and succ are great, and I was thinking to start a thread on how
they might be used for different data types. But I have to point out
that they are not enough on their own. If a user wanted the character
ten away from the current one then he wouldn't want to code ten succ
operations.

--
James Harris

Re: Storing strings

<tk84mf$32rij$2@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1908&group=comp.lang.misc#1908

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 6 Nov 2022 12:10:07 +0100
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <tk84mf$32rij$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk5ebh$2dc61$2@dont-email.me>
<tk5p07$17hl$1@gioia.aioe.org> <tk61o2$2hp0g$1@dont-email.me>
<tk65al$2hcgu$5@dont-email.me> <tk7u35$31tel$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 6 Nov 2022 11:10:07 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="7f2383d589da89ae0a3de4311e480df9";
logging-data="3239507"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19r/by5+uvkxrCBIpw9M05ssVG9wBa5ooA="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:MTnpSuLP9X8Y5ZbXxezCR6vH3S0=
Content-Language: en-GB
In-Reply-To: <tk7u35$31tel$2@dont-email.me>

by: David Brown - Sun, 6 Nov 2022 11:10 UTC

On 06/11/2022 10:17, James Harris wrote:
> On 05/11/2022 17:08, David Brown wrote:
>> On 05/11/2022 17:07, James Harris wrote:
>>> On 05/11/2022 13:38, Bart wrote:
>
> ..
>
>>>> My feeling is that it is these diverse requirements that require
>>>> user-supplied functions.
>>>
>>> Functions, yes, (or what appear to be functions) though surely they
>>> should be part of a library that comes with the language.
>>>
>>
>> How much of a library you provide depends on the goals of the
>> language. You don't have to provide /everything/ !
>
> That's good to hear. :)
>
> While I am trying to make it easy to invoke functions written by other
> people ISTM that it's also right for the language to have associated
> with it a load of standard provisions - i18n support, various data
> structures, display support, maths libraries, etc for one simple reason:
> code maintenance; it's easier to maintain software which uses library
> calls one already knows than to have to learn yet another set of i18n
> calls, for example.

Sure. But pick one i18n library, write the FFI wrapper, and call that
your standard library. Then users don't have to deal with third-party
code and libraries, and you don't have to learn the intricacies of how
to support multiple languages properly (you don't have enough lifetimes
to learn enough to write it yourself). Everyone wins!

>
> ..
>
>>> C's const propagation sounds like Java with its horrible, and sticky,
>>> exception propagation.
>>>
>>
>> Getting "const" right is something to think long and hard about. When
>> do you mean "constant", when do you mean "read-only", when do you mean
>> "I promise this data will never change", "I will assume this data will
>> never change", "I promise /I/ won't change this data via this
>> reference", "This data will be unchanged logically but may change in
>> underlying representation, such as using a cache of some sort", etc. ?
>
> That sounds really interesting and I'd like to get in to it but this is
> not the thread. If you wanted to start a new thread on the topic I would
> reply. Suffice to say here that I don't use "const" but do have "ro" and
> "rw" as usable in various contexts which effect many of the things you
> mention but I don't know if I have covered everything a programmer might
> need.
>
>>
>> Constness is a hugely powerful concept, and something you definitely
>> want in a language. Modern language design fashion is to making
>> things constant be default and require explicit indication that they
>> can change. Some programming languages (pure functional programming
>> languages, for example) have /only/ constant data - there is no such
>> thing as variables.
>
> I haven't gone that far but, for example, I have globals as, by default,
> read only and while parameters are read-write a function would have to
> keep the originals around if there's a chance they would be needed.
>
> As I say, though, such things need a thread of their own so I'll resist
> the urge to say more.
>

Fair enough. It's a big topic, and deserves its own thread. All I will
do here is encourage you to think hard about it, learn about it, and
test ideas early on - if you try to add "const" to a language later, it
will inevitably be a mess, complicated and inconsistent.

Re: Storing strings

<tk85fk$32rij$3@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1909&group=comp.lang.misc#1909

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: david.brown@hesbynett.no (David Brown)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 6 Nov 2022 12:23:32 +0100
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <tk85fk$32rij$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk64st$2hcgu$4@dont-email.me>
<tk7uns$31tel$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 6 Nov 2022 11:23:32 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="7f2383d589da89ae0a3de4311e480df9";
logging-data="3239507"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Fa+QFMrP6tDtCBy+W0wSAuvCa8TCfJak="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:K2xAMph5Xdav8tRXfx5qD1kKuXI=
Content-Language: en-GB
In-Reply-To: <tk7uns$31tel$3@dont-email.me>

by: David Brown - Sun, 6 Nov 2022 11:23 UTC

On 06/11/2022 10:28, James Harris wrote:
> On 05/11/2022 17:01, David Brown wrote:
>> On 04/11/2022 23:28, Bart wrote:
>
> ..
>
>>> I actually had such a restriction for a while: char*5 wasn't allowed,
>>> but char+1 was. After all why on earth shouldn't you want the next
>>> character in that alphabet? Why should code like this be made illegal:
>>>
>>>      a := a * 10 + (c - '0')
>>
>> Why not :
>>
>>      char a;        // Use whatever syntax you prefer
>>      int i;        // and whatever type names you prefer
>>
>>      a = digit(i);
>>
>> The function "digit" might be defined :
>>
>>      char digit(int i) {
>>          return char(i + ord('0'));
>>      }
>
> Wouldn't char and ord need a locale?

That would depend on what you are trying to do. If you wanted a real
multi-lingual "digit" function, then yes - and you'd return different
Unicode characters for different languages. But it is also important to
have some way of getting to the underlying representation of the
characters. At a minimum, you'll need that to implement the library of
functions for dealing with locales. (Maybe you want to distinguish
between "low-level" or "library implementation" code that is allowed to
do such things, and "user" code that is not - just as some languages
have "safe" and "unsafe" code modes.)

>
> That may be the wrong term but by locale I mean a bundled set of rules
> (including, in this case, what the digits are and how many there are of
> them) which apply to the language and region the program is executing for.
>
> Maybe it is the right term. I see on Wikipedia: "In computing, a locale
> is a set of parameters that defines the user's language, region and any
> special variant preferences that the user wants to see in their user
> interface."
>
> https://en.wikipedia.org/wiki/Locale_(computer_software)
>
>>
>> You want to find the next letter after "x"? "char(ord(x) + 1)". Or
>> perhaps, like Pascal and Ada, "succ(x)".
>
> pred and succ are great, and I was thinking to start a thread on how
> they might be used for different data types. But I have to point out
> that they are not enough on their own. If a user wanted the character
> ten away from the current one then he wouldn't want to code ten succ
> operations.
>

You could give "pred" and "succ" an optional step argument.

You will also have to think about what happens if the result doesn't
make sense - if you step beyond the range for the type, do you throw an
error of some sort? Should some kinds of types have wrapping succ/pred
operations?

Re: Storing strings

<tkrl8i$1hee4$2@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1984&group=comp.lang.misc#1984

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 13 Nov 2022 20:49:22 +0000
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <tkrl8i$1hee4$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
<tjmesm$7h72$5@dont-email.me> <tjour1$hlef$2@dont-email.me>
<tk0ns4$1g6c3$2@dont-email.me> <tk35qj$1que9$1@dont-email.me>
<tk3je2$1muc$1@gioia.aioe.org> <tk40ji$21394$1@dont-email.me>
<tk43lf$106j$1@gioia.aioe.org> <tk5ebh$2dc61$2@dont-email.me>
<tk5p07$17hl$1@gioia.aioe.org> <tk61o2$2hp0g$1@dont-email.me>
<tk65al$2hcgu$5@dont-email.me> <tk7u35$31tel$2@dont-email.me>
<tk84mf$32rij$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 13 Nov 2022 20:49:22 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="45c540c812ba7e7a9f8aaaa580905047";
logging-data="1620420"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18rppO85bIr8KmtO9FnNV7A644Ghffvs7c="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:unkDCjfj1tb1mVOuz07FadS50TE=
Content-Language: en-GB
In-Reply-To: <tk84mf$32rij$2@dont-email.me>

by: James Harris - Sun, 13 Nov 2022 20:49 UTC

On 06/11/2022 11:10, David Brown wrote:
> On 06/11/2022 10:17, James Harris wrote:
>> On 05/11/2022 17:08, David Brown wrote:

....

>>> How much of a library you provide depends on the goals of the
>>> language. You don't have to provide /everything/ !
>>
>> That's good to hear. :)
>>
>> While I am trying to make it easy to invoke functions written by other
>> people ISTM that it's also right for the language to have associated
>> with it a load of standard provisions - i18n support, various data
>> structures, display support, maths libraries, etc for one simple
>> reason: code maintenance; it's easier to maintain software which uses
>> library calls one already knows than to have to learn yet another set
>> of i18n calls, for example.
>
> Sure. But pick one i18n library, write the FFI wrapper, and call that
> your standard library. Then users don't have to deal with third-party
> code and libraries, and you don't have to learn the intricacies of how
> to support multiple languages properly (you don't have enough lifetimes
> to learn enough to write it yourself). Everyone wins!

Yes, internationalisation is an enormous area. I don't have anything
like the requisite knowledge of the world's languages and character sets
to do the work. What I /can/ do, AISI, is to establish principles and
restrictions intended to make multilingual programming more natural. For
example,

* To define the standard form of strings to have unadorned characters
stored as separate codes from diacritics (which I gather may be called
combining characters).

* Diacritic codes would all have to be stored in a certain order
relative to each other.

* String encodings would put diacritics before the character to which
they apply. (Though am not sure what to do about accents which apply to
whole words or groups of characters.)

* The language would not support ordering of characters based on their
internal codes. All ordering would require a locale to indicate which
should come first.

* String ordering would require creation of a 'comparison string'
created according to the rules of a selected locale. The internal codes
of the comparison string /would/ be comparable for orde5ring.

* There would be a standard API which all i18n libraries would have to
support.

etc

Whether that kind of approach is valid or not, I don't know. It's just
my best guess at what may be required.

BTW, emoticons are a complete nightmare. There could be any number of
them, they may render differently on different devices and no ordering
of them makes sense.

--
James Harris

Logic is a pretty flower that smells bad.

devel / comp.lang.misc / Re: Storing strings

Subject	Author
Storing strings	James Harris
Re: Storing strings	Stefan Ram
Re: Storing strings	James Harris
Re: Storing strings	Stefan Ram
Re: Storing strings	Stefan Ram
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Charles Lindsey
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	luserdroog
Re: Storing strings	James Harris
Re: Storing strings	luserdroog
Re: Storing strings	David Brown
Re: Storing strings	Bart
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	David Brown
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris