Welcome to RetroBBS

mail files register newsreader groups login

Message-ID:

In space, no one can hear you fart.

Storing strings

Subject	Author
Storing strings	James Harris
Re: Storing strings	Stefan Ram
Re: Storing strings	James Harris
Re: Storing strings	Stefan Ram
Re: Storing strings	Stefan Ram
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Charles Lindsey
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Dmitry A. Kazakov
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	luserdroog
Re: Storing strings	James Harris
Re: Storing strings	luserdroog
Re: Storing strings	David Brown
Re: Storing strings	Bart
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	David Brown
Re: Storing strings	David Brown
Re: Storing strings	David Brown
Re: Storing strings	James Harris
Re: Storing strings	Bart
Re: Storing strings	James Harris

Pages:12 3

Storing strings

<tj5phf$1lggf$2@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1843&group=comp.lang.misc#1843

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Storing strings
Date: Mon, 24 Oct 2022 11:31:11 +0100
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <tj5phf$1lggf$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 24 Oct 2022 10:31:11 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="5efc7b16795c751d46dd5c610abd0482";
logging-data="1753615"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5i6gWz6WRo8cCQdtgD755omrtCA7OZbY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:ynY94NLxrJMdI1t+kAgbzF+NJvs=
Content-Language: en-GB

by: James Harris - Mon, 24 Oct 2022 10:31 UTC

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

Options 1 and 2 have the advantage that they can be referred to simply
by address. Option 3 needs an additional place in which to store the
(first, past) control block.

Option 1 has the advantage that it's easy for a program to process (by
either pointer or index).

Options 1 and 3 have the advantage that one can refer to the tail of the
string (anything past the first character) without creating a copy,
although option 3 would need a new control block to be created. Option 2
would require a new string to be created.

In fact, option 3 has the advantage that it allows any continuous
substring - head, mid, or tail - to be referred to without making a copy
of the required part of the string.

Options 2 and 3 make it fast to find the length. They also allow any
value (i.e. including zero) to be part of the string.

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

If that's not complicated enough, the above essentially considers
strings whose contents could be read-only or read-write but their
lengths don't change. If the lengths can change then there are
additional issues of storage management. Eek! ;)

Recommendations welcome!

--
James Harris

Re: Storing strings

<strings-20221024123718@ram.dialup.fu-berlin.de>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1844&group=comp.lang.misc#1844

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: 24 Oct 2022 11:44:38 GMT
Organization: Stefan Ram
Lines: 25
Expires: 1 Sep 2023 11:59:58 GMT
Message-ID: <strings-20221024123718@ram.dialup.fu-berlin.de>
References: <tj5phf$1lggf$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de KHHUjWeI6hDw5sOTNOJNhg5YWqMruV8amtvy6d/Zw5Dsfw
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Mon, 24 Oct 2022 11:44 UTC

James Harris <james.harris.1@gmail.com> writes:
>So: Which of those should a compiler support? Should it support more
>than one form? If so, should the language allow the programmer to
>specify which form to use on any particular string?

I think the idea of C is to leave it up to the programmer.
The C string literals and functions are just some kind of
suggestion, and they help to provide basic services, such
as printing some text to the terminal. But otherwise, the
programmer is free to implement his own string type(s) or
use string libraries.

The choice depends on the expected type of use. For example,
some ways to store strings are known as "ropes" (Hans J Boehm,
1994), others are known as "gap buffers". A text editor
might simultaneously use ropes for its text buffers and
C strings for filenames.

The crucial thing for allowing programmers to implement
their own string type is that the languages is fast enough
to do this with little overhead compared to an implementation
of strings in the langugage itself. Implementing custom
string representations in slow languages might not feasible.

Re: Storing strings

<tj5v64$od0$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1845&group=comp.lang.misc#1845

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!lYnhq7byp2KtY/MFJZaCTw.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Mon, 24 Oct 2022 14:07:32 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tj5v64$od0$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="24992"; posting-host="lYnhq7byp2KtY/MFJZaCTw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.13.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Mon, 24 Oct 2022 12:07 UTC

On 2022-10-24 12:31, James Harris wrote:
> Do you guys have any thoughts on the best ways for strings of characters
> to be stored?
>
> 1. There's the C way, of course, of reserving one value (zero) and using
> it as a terminator.
>
> 2. There's the 'length prefix' option of putting the length of the
> string in a machine word before the characters.
>
> 3. There's the 'double pointer' way of pointing at, say, first and past
> (where 'past' means first plus length such that the second pointer
> points one position beyond the last character).
>
> Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are implemented.
In the later case the compiler knows the strings bounds (first and last
indices and thus the length). In the former case the compiler passes a
"string dope" along with the naked body. The dope contains the bounds.

This has an effect on pointers. E.g. if you want slices and efficient
raw strings you must distinguish pointers to definite (constrained) vs.
indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature you
would use a "fat pointer" to carry bounds with it.

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Storing strings

<tj67es$1bu4$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1846&group=comp.lang.misc#1846

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Mon, 24 Oct 2022 15:28:44 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tj67es$1bu4$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="44996"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.3.3
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Mon, 24 Oct 2022 14:28 UTC

On 24/10/2022 11:31, James Harris wrote:
> Do you guys have any thoughts on the best ways for strings of characters
> to be stored?
>
> 1. There's the C way, of course, of reserving one value (zero) and using
> it as a terminator.
>
> 2. There's the 'length prefix' option of putting the length of the
> string in a machine word before the characters.
>
> 3. There's the 'double pointer' way of pointing at, say, first and past
> (where 'past' means first plus length such that the second pointer
> points one position beyond the last character).
>
> Any others?
>
> Options 1 and 2 have the advantage that they can be referred to simply
> by address. Option 3 needs an additional place in which to store the
> (first, past) control block.
>
> Option 1 has the advantage that it's easy for a program to process (by
> either pointer or index).
>
> Options 1 and 3 have the advantage that one can refer to the tail of the
> string (anything past the first character) without creating a copy,
> although option 3 would need a new control block to be created. Option 2
> would require a new string to be created.
>
> In fact, option 3 has the advantage that it allows any continuous
> substring - head, mid, or tail - to be referred to without making a copy
> of the required part of the string.
>
> Options 2 and 3 make it fast to find the length. They also allow any
> value (i.e. including zero) to be part of the string.
>
> So: Which of those should a compiler support? Should it support more
> than one form? If so, should the language allow the programmer to
> specify which form to use on any particular string?
>
> If that's not complicated enough, the above essentially considers
> strings whose contents could be read-only or read-write but their
> lengths don't change. If the lengths can change then there are
> additional issues of storage management. Eek! ;)

For lower level strings, I'd highly recommend using zero-terminated
strings, or using them as the basis, or at least having it as an option.

This is not the 'C way', as I'd long used this outside of C and Unix
(eg. in DEC assembly, and in my own stuff for at least a decode before I
first dealt with C.

I still use them, and among many advantages such as pure simplicity,
allow you to directly make use of innumerable APIs that specify such
strings.

They can be used in contexts such as the compact string fields of
structs, since the only overhead is allowing space for that terminator **.

The next step up, in lower level code, is to use a slice. This is a
(pointer, length) descriptor. Here no terminator is necessary, and
allows strings to also contain embedded zeros (so can contain any binary
data).

String slices can point into another string (allowing sharing), or into
another slice, or into a regular zero-terminated string.

However to call an API function expecting a zero-terminated string
('stringz` as I sometimes call it), the pointer is not enough: you need
to ensure there's a zero following those <length> characters!

Within my dynamic scripting language, I have a full-on counted string
type, with reference counting to manage sharing and allow automatic
memory management. But with the same headache when calling low-level FFI
functions that expect C-like strings.

But that language at least will cope with it.

(** The scripting language can define structs with fixed types including
fixed-width string fields. Those are defined in two ways:

stringz*8 A
stringc*8 B

Both A and B occupy an 8-byte field. But A can store a maximum string of
7 characters, with B it can be 8 characters.

Yet B also includes the count so no scanning is needed to determine the
string length. The scheme however only works on fields of 2 to 256
characters.)

Re: Storing strings

<tjc2kf$2hget$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1847&group=comp.lang.misc#1847

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Wed, 26 Oct 2022 20:43:11 +0100
Organization: A noiseless patient Spider
Lines: 69
Message-ID: <tjc2kf$2hget$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 26 Oct 2022 19:43:11 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="cd5aa069823abdd82f8fec39983b80d3";
logging-data="2671069"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19XkPbj3Khr8vG7LAdHd4c3yM52aBwwS2k="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:8+FO0I2oba0Gaix+MIxVIhNm+q0=
Content-Language: en-GB
In-Reply-To: <tj5v64$od0$1@gioia.aioe.org>

by: James Harris - Wed, 26 Oct 2022 19:43 UTC

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
> On 2022-10-24 12:31, James Harris wrote:
>> Do you guys have any thoughts on the best ways for strings of
>> characters to be stored?
>>
>> 1. There's the C way, of course, of reserving one value (zero) and
>> using it as a terminator.
>>
>> 2. There's the 'length prefix' option of putting the length of the
>> string in a machine word before the characters.
>>
>> 3. There's the 'double pointer' way of pointing at, say, first and
>> past (where 'past' means first plus length such that the second
>> pointer points one position beyond the last character).
>>
>> Any others?
>
> 4. String body only. The constraints are known outside.
>
> This is the way string slices and fixed length strings are implemented.
> In the later case the compiler knows the strings bounds (first and last
> indices and thus the length). In the former case the compiler passes a
> "string dope" along with the naked body. The dope contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear, case
3 would be represented by, in addition to the bytes of the string,

struct
first: pointer to first byte of string
past: pointer to byte after last byte of string
.... other fields ....
end struct

The string length would be past - first. The bytes of the string would
be those pointed at (which I presume is what you are calling the naked
body).

>
> This has an effect on pointers. E.g. if you want slices and efficient
> raw strings you must distinguish pointers to definite (constrained) vs.
> indefinite (unconstrained) objects of same type.
>
> E.g. in Ada you cannot take an indefinite string pointer to a fixed
> length string because there is no bounds. If you wanted that feature you
> would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct, above?

>
> This is similar to atomic, volatile objects and pointers to. The
> mechanics is same. You cannot take a general-purpose pointer to an
> atomic object, because the client code would not know that it should
> take care upon dereferencing.
>

I am not sure what that means. I guess the point you are making is that
there are levels of classification which don't affect the data type but
they do affect how it can be accessed - with the language needing to
prevent a reference weakening the storage model. For example, a
read-write reference to a substring should be prevented from being used
to access part of a string which is supposed to be read-only.

--
James Harris

Re: Storing strings

<tjc5jn$2hpui$3@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1848&group=comp.lang.misc#1848

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Wed, 26 Oct 2022 21:33:59 +0100
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <tjc5jn$2hpui$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj67es$1bu4$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 26 Oct 2022 20:33:59 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="cd5aa069823abdd82f8fec39983b80d3";
logging-data="2680786"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Te7c4GZs0+0QX4HqsTpHwo6RnflUrfO0="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:Fl9QCoy3l5h478STmjCda3DzEzo=
Content-Language: en-GB
In-Reply-To: <tj67es$1bu4$1@gioia.aioe.org>

by: James Harris - Wed, 26 Oct 2022 20:33 UTC

On 24/10/2022 15:28, Bart wrote:
> On 24/10/2022 11:31, James Harris wrote:

>> Do you guys have any thoughts on the best ways for strings of
>> characters to be stored?

...

> For lower level strings, I'd highly recommend using zero-terminated
> strings, or using them as the basis, or at least having it as an option.

They certainly seem easiest to work with although they do have
limitations such as:

* cannot include a character with the encoding of zero (as you say)
* must be scanned to determine length
* awkward to add to or delete from the end of as they don't carry any
data about whether the memory immediately following is available or not

>
> This is not the 'C way', as I'd long used this outside of C and Unix
> (eg. in DEC assembly, and in my own stuff for at least a decode before I
> first dealt with C.

True, though C popularised the scheme. Besides, on the PDP one way of
storing strings was apparently as

(length, address)

That's according to the Commercial Instruction Set (CIS) part of

https://en.wikipedia.org/wiki/PDP-11_architecture

>
> I still use them, and among many advantages such as pure simplicity,
> allow you to directly make use of innumerable APIs that specify such
> strings.
>
> They can be used in contexts such as the compact string fields of
> structs, since the only overhead is allowing space for that terminator **.
>

OK.

>
> The next step up, in lower level code, is to use a slice. This is a
> (pointer, length) descriptor. Here no terminator is necessary, and
> allows strings to also contain embedded zeros (so can contain any binary
> data).
>
> String slices can point into another string (allowing sharing), or into
> another slice, or into a regular zero-terminated string.

That's more universal and therefore perhaps the best to implement if
only one scheme is to be available. Have to say, though, I guess it
would be hard to manage the memory for. Instead of just (first, length)
or (first, past) perhaps one would need something like

struct
first: pointer to first element
past: pointer just past last element
count: number of slices pointing to this slice/string
base: the parent string or memory
flags: various
end struct

The base field would refer to the string object we were a slice of or,
if we were not a slice but the base string, the memory area in which the
string was stored.

The flags would indicate whether the string/slice could have its
contents changed and whether it could have its length changed, whether
the contents could be moved in memory, etc.

>
> However to call an API function expecting a zero-terminated string
> ('stringz` as I sometimes call it), the pointer is not enough: you need
> to ensure there's a zero following those <length> characters!
>
>
> Within my dynamic scripting language, I have a full-on counted string
> type, with reference counting to manage sharing and allow automatic
> memory management.

What fields did you use to manage such stuff? Am I on the right lines
with the ideas above?

> But with the same headache when calling low-level FFI
> functions that expect C-like strings.

Just a thought: ensure there is always at least one more byte of memory
than the string requires and put a zero byte at the end of the string
before calling any function which expects a C-like string. (User
responsibility to ensure there are no zero bytes embedded in the string.)

Perhaps one reason is that some predefined data structures include a
fixed-length field in which a string can sit but which has no room for
another byte such as a terminating zero. But for them the string could
be copied out.

Having a string defined by a (first, past) pair would perhaps allow
fixed-length fields to be handled as easily as mutable strings.

--
James Harris

Re: Storing strings

<tjc6lm$2hpui$4@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1849&group=comp.lang.misc#1849

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Wed, 26 Oct 2022 21:52:06 +0100
Organization: A noiseless patient Spider
Lines: 69
Message-ID: <tjc6lm$2hpui$4@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<strings-20221024123718@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 26 Oct 2022 20:52:06 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="cd5aa069823abdd82f8fec39983b80d3";
logging-data="2680786"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wsBMGkGLB9TsYkCj3figzMYynggY6xYU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:L0WknffPc5Dia3068at5FPbQraE=
Content-Language: en-GB
In-Reply-To: <strings-20221024123718@ram.dialup.fu-berlin.de>

by: James Harris - Wed, 26 Oct 2022 20:52 UTC

On 24/10/2022 12:44, Stefan Ram wrote:
> James Harris <james.harris.1@gmail.com> writes:
>> So: Which of those should a compiler support? Should it support more
>> than one form? If so, should the language allow the programmer to
>> specify which form to use on any particular string?
>
> I think the idea of C is to leave it up to the programmer.
> The C string literals and functions are just some kind of
> suggestion, and they help to provide basic services, such
> as printing some text to the terminal. But otherwise, the
> programmer is free to implement his own string type(s) or
> use string libraries.
>
> The choice depends on the expected type of use. For example,
> some ways to store strings are known as "ropes" (Hans J Boehm,
> 1994), others are known as "gap buffers". A text editor
> might simultaneously use ropes for its text buffers and
> C strings for filenames.

Thanks for the references.

>
> The crucial thing for allowing programmers to implement
> their own string type is that the languages is fast enough
> to do this with little overhead compared to an implementation
> of strings in the langugage itself. Implementing custom
> string representations in slow languages might not feasible.

That's fair. I would like, however, to have an inbuilt string type that
is easy to work with so that there's a pre-made standard and programmers
don't have to come up with their own or to spend time working out what a
previous programmer had created.

I should have suggested a string or slice interface. Here's a first
attempt at the operations a string would be expected to be hit with.

These are part of the mechanics of string handling, relating to the
structure of the string rather than to its contents, so I've not
included anything in this list which looks at the content of the string.
There are loads of content-based operations such as string comparisons,
case conversion, whitespace trimming, etc, which could be built on top
of the basic handling.

Potential operations on string structures:
* allocate a new string
* create a slice (view) of an existing string
* index into a string
* increase the size of a string
* reduce the size of a string
* return the length of the string
* append/delete characters from the end
* insert/delete characters at the beginning
* take a slice of a string
* concatenate strings (including copying)
* pass to and from functions

The idea of slices is that they would appear the be strings but could be
created to refer to the same string elements without allocating new
storage for the sliced data.

These are just some ideas on what might be required. To do this
comprehensively seems rather complicated! :(

--
James Harris

Re: Storing strings

<tjdbvk$1sps$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1850&group=comp.lang.misc#1850

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!lYnhq7byp2KtY/MFJZaCTw.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Thu, 27 Oct 2022 09:28:51 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tjdbvk$1sps$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="62268"; posting-host="lYnhq7byp2KtY/MFJZaCTw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Dmitry A. Kazakov - Thu, 27 Oct 2022 07:28 UTC

On 2022-10-26 21:43, James Harris wrote:
> On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
>> On 2022-10-24 12:31, James Harris wrote:
>>> Do you guys have any thoughts on the best ways for strings of
>>> characters to be stored?
>>>
>>> 1. There's the C way, of course, of reserving one value (zero) and
>>> using it as a terminator.
>>>
>>> 2. There's the 'length prefix' option of putting the length of the
>>> string in a machine word before the characters.
>>>
>>> 3. There's the 'double pointer' way of pointing at, say, first and
>>> past (where 'past' means first plus length such that the second
>>> pointer points one position beyond the last character).
>>>
>>> Any others?
>>
>> 4. String body only. The constraints are known outside.
>>
>> This is the way string slices and fixed length strings are
>> implemented. In the later case the compiler knows the strings bounds
>> (first and last indices and thus the length). In the former case the
>> compiler passes a "string dope" along with the naked body. The dope
>> contains the bounds.
>
> That doesn't seem meaningfully different from case 3. To be clear, case
> 3 would be represented by, in addition to the bytes of the string,
>
> struct
> first: pointer to first byte of string
> past: pointer to byte after last byte of string
> .... other fields ....
> end struct
>
> The string length would be past - first. The bytes of the string would
> be those pointed at (which I presume is what you are calling the naked
> body).

That is the structure of a string dope, not the string itself, unless
you have the body in other fields, but then why would you need pointers?

To clarify terms. String representation must include the string body if
we are talking about values of strings. The things like pointers and
vectorized dopes are references to a string, not strings. You can pass a
string by a reference, sure. But the string value is somewhere else.
What you pass is not a string it is a substitute.

>> This has an effect on pointers. E.g. if you want slices and efficient
>> raw strings you must distinguish pointers to definite (constrained)
>> vs. indefinite (unconstrained) objects of same type.
>>
>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>> length string because there is no bounds. If you wanted that feature
>> you would use a "fat pointer" to carry bounds with it.
>
> Any reason you'd recommend against storing bounds as in the struct, above?

Start with interoperability of strings and slices of. The crucial
requirements would be:

A slice can be passed to a subprogram expecting a string without
copying.

Consider efficiency and low-level close to hardware stuff:

Aggregation of strings with known bounds does not require storing them.

E.g. you can have arrays of fixed length strings (like an image buffer).
If a member of a structure is a fixed length string, no bounds are
stored. A pointer to a fixed length string is a plain pointer etc.

>> This is similar to atomic, volatile objects and pointers to. The
>> mechanics is same. You cannot take a general-purpose pointer to an
>> atomic object, because the client code would not know that it should
>> take care upon dereferencing.
>
> I am not sure what that means. I guess the point you are making is that
> there are levels of classification which don't affect the data type but
> they do affect how it can be accessed - with the language needing to
> prevent a reference weakening the storage model. For example, a
> read-write reference to a substring should be prevented from being used
> to access part of a string which is supposed to be read-only.

Yes, it is a type constraint. There are all sorts of constraints one
could put on a type in order to produce a constrained subtype.
Constraining limits operations, e.g. immutability removes mutators. It
also directs certain implementations like using locking instructions or
dropping known bounds.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Storing strings

<strings-20221027095502@ram.dialup.fu-berlin.de>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1851&group=comp.lang.misc#1851

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: 27 Oct 2022 08:58:50 GMT
Organization: Stefan Ram
Lines: 14
Expires: 1 Sep 2023 11:59:58 GMT
Message-ID: <strings-20221027095502@ram.dialup.fu-berlin.de>
References: <tj5phf$1lggf$2@dont-email.me> <strings-20221024123718@ram.dialup.fu-berlin.de> <tjc6lm$2hpui$4@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 8n6gDVhILDQbZj9TtcMH1wzQLw9N1z0Q9qJnou8bmvhhdL
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Thu, 27 Oct 2022 08:58 UTC

James Harris <james.harris.1@gmail.com> writes:
>Potential operations on string structures:
>* allocate a new string
>* create a slice (view) of an existing string
>* index into a string

Many of such operations are provided by the standard library
of C++. You could have a look at its implementation. One might
even think of kinda "backporting" it to C. Or use C++.

Suggested Video: "The strange details of std::string at
Facebook" - Nicholas Ormrod (2016)

Re: Storing strings

<Python-20221027111922@ram.dialup.fu-berlin.de>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1852&group=comp.lang.misc#1852

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: 27 Oct 2022 10:19:34 GMT
Organization: Stefan Ram
Lines: 11
Expires: 1 Sep 2023 11:59:58 GMT
Message-ID: <Python-20221027111922@ram.dialup.fu-berlin.de>
References: <tj5phf$1lggf$2@dont-email.me> <strings-20221024123718@ram.dialup.fu-berlin.de> <tjc6lm$2hpui$4@dont-email.me> <strings-20221027095502@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de BdFJCVKGWmlUf6BIPlUZ1wl0cf+lpKsk+wLiE4glZuTaRS
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Thu, 27 Oct 2022 10:19 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:
>Many of such operations are provided by the standard library
>of C++. You could have a look at its implementation. One might
>even think of kinda "backporting" it to C. Or use C++.

One could also look at the implementation of strings in
Python. Python already is a library that can be used
from C. So, one could use Python in C as a library just
for its data types or just for string handling.

Re: Storing strings

<tjdp6i$2cr$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1853&group=comp.lang.misc#1853

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Thu, 27 Oct 2022 12:14:28 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjdp6i$2cr$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj67es$1bu4$1@gioia.aioe.org>
<tjc5jn$2hpui$3@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="2459"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.3.3
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Thu, 27 Oct 2022 11:14 UTC

On 26/10/2022 21:33, James Harris wrote:
> On 24/10/2022 15:28, Bart wrote:
>> On 24/10/2022 11:31, James Harris wrote:
>
>>> Do you guys have any thoughts on the best ways for strings of
>>> characters to be stored?
>
> ..
>
>> For lower level strings, I'd highly recommend using zero-terminated
>> strings, or using them as the basis, or at least having it as an option.
>
> They certainly seem easiest to work with although they do have
> limitations such as:
>
> * cannot include a character with the encoding of zero (as you say)
> * must be scanned to determine length

I use such strings in my 1Mlps compilers. Note that in that application:

* You don't need to get string lengths that often
* The vast majority of strings are short
* String append (that you mention below) is mainly when speed is not
critical (eg. for diagnostics)

> * awkward to add to or delete from the end of as they don't carry any
> data about whether the memory immediately following is available or not

>> The next step up, in lower level code, is to use a slice. This is a
>> (pointer, length) descriptor. Here no terminator is necessary, and
>> allows strings to also contain embedded zeros (so can contain any
>> binary data).
>>
>> String slices can point into another string (allowing sharing), or
>> into another slice, or into a regular zero-terminated string.
>
> That's more universal and therefore perhaps the best to implement if
> only one scheme is to be available.

Most strings are fixed-length once created; strings that can grow are
rare. You don't need a 'capacity' field for example (like C++'s Vector
type).

But managing memory can still be an issue because you don't know if a
particular slice owns its memory, or points to a string literal, or
points into a shared string, or points to external memory.

So a simple slice suits a lower-level language where you do this manual
(it would be a welcome addition to C for example).

My main language is just like this.

Have to say, though, I guess it
> would be hard to manage the memory for. Instead of just (first, length)
> or (first, past) perhaps one would need something like
>
> struct
>     first: pointer to first element
>     past: pointer just past last element
>     count: number of slices pointing to this slice/string
>     base: the parent string or memory
>     flags: various
> end struct
>
> The base field would refer to the string object we were a slice of or,
> if we were not a slice but the base string, the memory area in which the
> string was stored.
>
> The flags would indicate whether the string/slice could have its
> contents changed and whether it could have its length changed, whether
> the contents could be moved in memory, etc.
>
>>
>> However to call an API function expecting a zero-terminated string
>> ('stringz` as I sometimes call it), the pointer is not enough: you
>> need to ensure there's a zero following those <length> characters!

>>
>> Within my dynamic scripting language, I have a full-on counted string
>> type, with reference counting to manage sharing and allow automatic
>> memory management.
>
> What fields did you use to manage such stuff? Am I on the right lines
> with the ideas above?

The structure I use is not lightweight because it is for interpreted
code. The following object descriptor is a 32-byte record, used for all
objects. I've shown only the fields used by string objects:

record objrec =
u32 refcount
byte mutable # 1 for mutable strings
byte objtype
u16 dummy

ichar strptr # (ref char)
u64 length
union
u64 alloc64
object objptr2 # (ref objptr)
end
end

The string data itself is separate, pointed to by 'strptr'. This is nil
when the length is zero (it doesn't point to ""). It is not
zero-terminated (unless an external slice happens to be).

Most strings are mutable, then .alloc64 gives the capacity of the
allocation.

An important field is objtype; its values are:

Normal Regular string (uses alloc64)
Slice Slice into another (uses objptr2)
Extslice Strings lie outside the object scheme

For slices, while .strptr refers to the string data in question,
..objptr2 refs to the owner object of that string, which has its own
refcount.

External strings are those that belong to external code (eg. from an FFI
function), or those occuring inside a packed struct field for example.

So .objtype is used when sharing or freeing string data.

As I said, this is for interpreted code which can afford to do this
fiddly checking at runtime, which is not done inline either.

For static languages using inline code, it might need to be more
streamlined.

Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
and .length fields) correspond to a raw Slice as used in my lower level
language.

>> But with the same headache when calling low-level FFI functions that
>> expect C-like strings.
>
> Just a thought: ensure there is always at least one more byte of memory
> than the string requires and put a zero byte at the end of the string
> before calling any function which expects a C-like string. (User
> responsibility to ensure there are no zero bytes embedded in the string.)

I think I tried that once. In general it doesn't work, as you might have
a slice into another string; you can't inject a zero byte into the
middle of that other string!

Re: Storing strings

<tjj2eu$3eqiv$2@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1854&group=comp.lang.misc#1854

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 12:23:10 +0100
Organization: A noiseless patient Spider
Lines: 170
Message-ID: <tjj2eu$3eqiv$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 29 Oct 2022 11:23:10 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d20e7d3c01307898664339a7a7665a4c";
logging-data="3631711"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18mQn4YNR3AihCv6ztDJG20YMbDnusVoaY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:8Lr3RCs0Ix3XUGpc7FnHtEG1Hm8=
In-Reply-To: <tjdbvk$1sps$1@gioia.aioe.org>
Content-Language: en-GB

by: James Harris - Sat, 29 Oct 2022 11:23 UTC

On 27/10/2022 08:28, Dmitry A. Kazakov wrote:
> On 2022-10-26 21:43, James Harris wrote:
>> On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
>>> On 2022-10-24 12:31, James Harris wrote:
>>>> Do you guys have any thoughts on the best ways for strings of
>>>> characters to be stored?
>>>>
>>>> 1. There's the C way, of course, of reserving one value (zero) and
>>>> using it as a terminator.
>>>>
>>>> 2. There's the 'length prefix' option of putting the length of the
>>>> string in a machine word before the characters.
>>>>
>>>> 3. There's the 'double pointer' way of pointing at, say, first and
>>>> past (where 'past' means first plus length such that the second
>>>> pointer points one position beyond the last character).
>>>>
>>>> Any others?
>>>
>>> 4. String body only. The constraints are known outside.
>>>
>>> This is the way string slices and fixed length strings are
>>> implemented. In the later case the compiler knows the strings bounds
>>> (first and last indices and thus the length). In the former case the
>>> compiler passes a "string dope" along with the naked body. The dope
>>> contains the bounds.
>>
>> That doesn't seem meaningfully different from case 3. To be clear,
>> case 3 would be represented by, in addition to the bytes of the string,
>>
>> struct
>>    first: pointer to first byte of string
>>    past: pointer to byte after last byte of string
>>    .... other fields ....
>> end struct
>>
>> The string length would be past - first. The bytes of the string would
>> be those pointed at (which I presume is what you are calling the naked
>> body).
>
> That is the structure of a string dope, not the string itself, unless
> you have the body in other fields, but then why would you need pointers?

Curious use of terms. I presume that by "dope" you mean a dope vector
which can also be called a control block or a descriptor.

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster than
the others and does not run into problems when the elements are other
than single bytes.

To explain, for common operations,

memsize() = past - first
length() = memsize() >> alignbits
forward iteration proceeds while address < past
backward iteration proceeds while address >= first

The only stipulation is that the body must not be allocated at the very
top or bottom of the addressable range.

Using (first, past) should be as simple as that. By contrast, the
similar (first, last) runs into a slight problem when elements are wider
than single bytes: should the last pointer point to the start or the end
of the last item?

The others, which involve memsize or length, make it slightly slower to
judge the limits of iteration in the general case, requiring a
calculation to see if a pointer is outside the limits of the string
being referred to.

>
> To clarify terms. String representation must include the string body if
> we are talking about values of strings. The things like pointers and
> vectorized dopes are references to a string, not strings. You can pass a
> string by a reference, sure. But the string value is somewhere else.
> What you pass is not a string it is a substitute.

That depends, surely, on how "a string" is defined. If strings are
defined as descriptors starting with the fields first and past then the
bodies of such strings can be elsewhere. (There would be other fields of
a string descriptor to assist with memory management and probably some
flags, though I am open to suggestions as to what those fields should be.)

>
>>> This has an effect on pointers. E.g. if you want slices and efficient
>>> raw strings you must distinguish pointers to definite (constrained)
>>> vs. indefinite (unconstrained) objects of same type.
>>>
>>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>>> length string because there is no bounds. If you wanted that feature
>>> you would use a "fat pointer" to carry bounds with it.
>>
>> Any reason you'd recommend against storing bounds as in the struct,
>> above?
>
> Start with interoperability of strings and slices of. The crucial
> requirements would be:
>
> A slice can be passed to a subprogram expecting a string without
> copying.

Indeed, that's a major benefit of slices, IMO, being able to pass
something which looks and acts like a string but which doesn't need the
elements of the string to be copied.

That said, a slice would probably have a length which the callee can
determine but which the callee cannot change. I presume that's what
you'd call a constraint.

If a callee wanted to be able to change the length of a string then it
would have to be passed a real string, not a slice.

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be altered
by the callee. (Could be a real string or a slice.)

>
> Consider efficiency and low-level close to hardware stuff:
>
> Aggregation of strings with known bounds does not require storing them.
>
> E.g. you can have arrays of fixed length strings (like an image buffer).
> If a member of a structure is a fixed length string, no bounds are
> stored. A pointer to a fixed length string is a plain pointer etc.

You mean the string bounds could be known at compile time, say, rather
than at run time. Good point. Any suggestions on how that should be
implemented?

>
>>> This is similar to atomic, volatile objects and pointers to. The
>>> mechanics is same. You cannot take a general-purpose pointer to an
>>> atomic object, because the client code would not know that it should
>>> take care upon dereferencing.
>>
>> I am not sure what that means. I guess the point you are making is
>> that there are levels of classification which don't affect the data
>> type but they do affect how it can be accessed - with the language
>> needing to prevent a reference weakening the storage model. For
>> example, a read-write reference to a substring should be prevented
>> from being used to access part of a string which is supposed to be
>> read-only.
>
> Yes, it is a type constraint. There are all sorts of constraints one
> could put on a type in order to produce a constrained subtype.
> Constraining limits operations, e.g. immutability removes mutators. It
> also directs certain implementations like using locking instructions or
> dropping known bounds.
>

Was with you all the way until you mentioned dropping known bounds. What
does that mean? How can it be legitimate to drop any bounds?

--
James Harris

Re: Storing strings

<tjj62d$ap3$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1855&group=comp.lang.misc#1855

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 13:24:45 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjj62d$ap3$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="11043"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Sat, 29 Oct 2022 12:24 UTC

On 29/10/2022 12:23, James Harris wrote:
> On 27/10/2022 08:28, Dmitry A. Kazakov wrote:
>> On 2022-10-26 21:43, James Harris wrote:
>>> On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
>>>> On 2022-10-24 12:31, James Harris wrote:
>>>>> Do you guys have any thoughts on the best ways for strings of
>>>>> characters to be stored?
>>>>>
>>>>> 1. There's the C way, of course, of reserving one value (zero) and
>>>>> using it as a terminator.
>>>>>
>>>>> 2. There's the 'length prefix' option of putting the length of the
>>>>> string in a machine word before the characters.
>>>>>
>>>>> 3. There's the 'double pointer' way of pointing at, say, first and
>>>>> past (where 'past' means first plus length such that the second
>>>>> pointer points one position beyond the last character).
>>>>>
>>>>> Any others?
>>>>
>>>> 4. String body only. The constraints are known outside.
>>>>
>>>> This is the way string slices and fixed length strings are
>>>> implemented. In the later case the compiler knows the strings bounds
>>>> (first and last indices and thus the length). In the former case the
>>>> compiler passes a "string dope" along with the naked body. The dope
>>>> contains the bounds.
>>>
>>> That doesn't seem meaningfully different from case 3. To be clear,
>>> case 3 would be represented by, in addition to the bytes of the string,
>>>
>>> struct
>>>    first: pointer to first byte of string
>>>    past: pointer to byte after last byte of string
>>>    .... other fields ....
>>> end struct
>>>
>>> The string length would be past - first. The bytes of the string
>>> would be those pointed at (which I presume is what you are calling
>>> the naked body).
>>
>> That is the structure of a string dope, not the string itself, unless
>> you have the body in other fields, but then why would you need pointers?
>
> Curious use of terms. I presume that by "dope" you mean a dope vector
> which can also be called a control block or a descriptor.
>
> As for this specific case, the same information can be conveyed in
> different ways: (start, length), (start, memsize), (first, last),
> (first, past). I chose the latter as it should be slightly faster than
> the others and does not run into problems when the elements are other
> than single bytes.
>
> To explain, for common operations,
>
> memsize() = past - first
> length() = memsize() >> alignbits
> forward iteration proceeds while address < past
> backward iteration proceeds while address >= first
>
> The only stipulation is that the body must not be allocated at the very
> top or bottom of the addressable range.
>
> Using (first, past) should be as simple as that. By contrast, the
> similar (first, last) runs into a slight problem when elements are wider
> than single bytes: should the last pointer point to the start or the end
> of the last item?
>
> The others, which involve memsize or length, make it slightly slower to
> judge the limits of iteration in the general case, requiring a
> calculation to see if a pointer is outside the limits of the string
> being referred to.
>
>>
>> To clarify terms. String representation must include the string body
>> if we are talking about values of strings. The things like pointers
>> and vectorized dopes are references to a string, not strings. You can
>> pass a string by a reference, sure. But the string value is somewhere
>> else. What you pass is not a string it is a substitute.
>
> That depends, surely, on how "a string" is defined. If strings are
> defined as descriptors starting with the fields first and past then the
> bodies of such strings can be elsewhere. (There would be other fields of
> a string descriptor to assist with memory management and probably some
> flags, though I am open to suggestions as to what those fields should be.)
>
>>
>>>> This has an effect on pointers. E.g. if you want slices and
>>>> efficient raw strings you must distinguish pointers to definite
>>>> (constrained) vs. indefinite (unconstrained) objects of same type.
>>>>
>>>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>>>> length string because there is no bounds. If you wanted that feature
>>>> you would use a "fat pointer" to carry bounds with it.
>>>
>>> Any reason you'd recommend against storing bounds as in the struct,
>>> above?
>>
>> Start with interoperability of strings and slices of. The crucial
>> requirements would be:
>>
>>     A slice can be passed to a subprogram expecting a string without
>> copying.
>
> Indeed, that's a major benefit of slices, IMO, being able to pass
> something which looks and acts like a string but which doesn't need the
> elements of the string to be copied.
>
> That said, a slice would probably have a length which the callee can
> determine but which the callee cannot change. I presume that's what
> you'd call a constraint.
>
> If a callee wanted to be able to change the length of a string then it
> would have to be passed a real string, not a slice.
>
> I guess there would be these kinds of string argument:
>
> 1. Read-write string. Anything could be done to the string by the
> callee. (Would have to be a real string.)
>
> 2. Read-write fixed-length string. The string's contents could be
> altered but it could not be made longer or shorter. (Could be a real
> string or a slice.)
>
> 3. Read-only string. Neither its length nor it contents could be altered
> by the callee. (Could be a real string or a slice.)

4. Extensible string. This is not quite the same as your (1) which
requires only a mutable string.

You can mutate a string (alter individual characters) without needing to
know the overall length or its allocated capacity.

(You might further split that into mutable/non-mutable extensible
strings. Usually if growing a string by appending to it, you don't want
to also alter existing parts of the string.)

(You probably need to consider Unicode strings too, especially if
represented as UTF8, as the meaning of 'length' needs pinning down.)

Re: Storing strings

<tjj8op$3fquo$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1856&group=comp.lang.misc#1856

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 14:10:49 +0100
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <tjj8op$3fquo$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me>
<strings-20221024123718@ram.dialup.fu-berlin.de>
<tjc6lm$2hpui$4@dont-email.me>
<strings-20221027095502@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 29 Oct 2022 13:10:49 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d20e7d3c01307898664339a7a7665a4c";
logging-data="3664856"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18sViyZb5KWqllJcurZ8rLA5qAVrQgOYT4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:ZNl6/C0D2xTSehgWfcjX5rMgtOA=
Content-Language: en-GB
In-Reply-To: <strings-20221027095502@ram.dialup.fu-berlin.de>

by: James Harris - Sat, 29 Oct 2022 13:10 UTC

On 27/10/2022 09:58, Stefan Ram wrote:
> James Harris <james.harris.1@gmail.com> writes:
>> Potential operations on string structures:
>> * allocate a new string
>> * create a slice (view) of an existing string
>> * index into a string
>
> Many of such operations are provided by the standard library
> of C++. You could have a look at its implementation. One might
> even think of kinda "backporting" it to C. Or use C++.
>
> Suggested Video: "The strange details of std::string at
> Facebook" - Nicholas Ormrod (2016)

Thanks. I had a look at that video - and a number of others.

--
James Harris

Re: Storing strings

<tjjbnu$3gqcc$2@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1857&group=comp.lang.misc#1857

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 15:01:34 +0100
Organization: A noiseless patient Spider
Lines: 147
Message-ID: <tjjbnu$3gqcc$2@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj67es$1bu4$1@gioia.aioe.org>
<tjc5jn$2hpui$3@dont-email.me> <tjdp6i$2cr$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 29 Oct 2022 14:01:35 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d20e7d3c01307898664339a7a7665a4c";
logging-data="3697036"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18p5Om5/Pc7nLEhxiJSW5jz6HSaDZfA7Lc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:q86A3fldyYxnurhaywevU/SZ4v8=
Content-Language: en-GB
In-Reply-To: <tjdp6i$2cr$1@gioia.aioe.org>

by: James Harris - Sat, 29 Oct 2022 14:01 UTC

On 27/10/2022 12:14, Bart wrote:
> On 26/10/2022 21:33, James Harris wrote:
>> On 24/10/2022 15:28, Bart wrote:
>>> On 24/10/2022 11:31, James Harris wrote:
>>
>>>> Do you guys have any thoughts on the best ways for strings of
>>>> characters to be stored?

...

>>> String slices can point into another string (allowing sharing), or
>>> into another slice, or into a regular zero-terminated string.
>>
>> That's more universal and therefore perhaps the best to implement if
>> only one scheme is to be available.
>
> Most strings are fixed-length once created; strings that can grow are
> rare. You don't need a 'capacity' field for example (like C++'s Vector
> type).

Having watched some videos on string storage recently I now think I know
what you mean by the capacity field - basically that a string descriptor
would consist of these fields:

start
length
capacity

so that the string could be extended at the end (up to the capacity).
That may be a bit restrictive. A programmer might want to remove or add
characters at the beginning rather than just at the end, even though
such would be done less often.

So what do you think of having a string descriptor more like

first
past
memfirst
mempast

where memfirst and mempast would define the allocated space in which the
string body would sit.

Or perhaps the descriptor should be unified with other references to
memory - as I understand is true of your example, below.

>
> But managing memory can still be an issue because you don't know if a
> particular slice owns its memory, or points to a string literal, or
> points into a shared string, or points to external memory.

Yes, some flags would be needed. They could be stored in the low-order
bits of memfirst and mempast given that:

a) allocations (hence, memfirst) could be aligned
b) sizes of allocations (hence, mempast) could be rounded up to a
suitable power of 2
c) memfirst and mempast would need to be used far less often than first
and past so there would be no great problem with the cost of masking out
the low-order bits to get the addresses.

...

>>> Within my dynamic scripting language, I have a full-on counted string
>>> type, with reference counting to manage sharing and allow automatic
>>> memory management.
>>
>> What fields did you use to manage such stuff? Am I on the right lines
>> with the ideas above?
>
> The structure I use is not lightweight because it is for interpreted
> code. The following object descriptor is a 32-byte record, used for all
> objects. I've shown only the fields used by string objects:
>
>     record objrec =
>         u32         refcount
>         byte        mutable      # 1 for mutable strings
>         byte        objtype
>         u16         dummy
>
>         ichar       strptr       # (ref char)
>         u64         length
>         union
>             u64     alloc64
>             object objptr2      # (ref objptr)
>         end
>     end

That looks very sensible. I have considered having 'sentient references'
which would have a common format for anything which refers to memory
(especially referents of dynamic size) and would include the address of
a vtable for the specific type of sentient reference. The vtable would
hold the addresses of methods which could be applied to the reference
rather than to the referent. IOW the referent and the reference would
each have a type.

ATM I think I'd need to work through a lot more use cases before I would
be ready to settle on the details of that so for now I may just go with
the idea of a string descriptor.

>
> The string data itself is separate, pointed to by 'strptr'. This is nil
> when the length is zero (it doesn't point to ""). It is not
> zero-terminated (unless an external slice happens to be).
>
> Most strings are mutable, then .alloc64 gives the capacity of the
> allocation.
>
> An important field is objtype; its values are:
>
>     Normal            Regular string (uses alloc64)
>     Slice             Slice into another (uses objptr2)
>     Extslice          Strings lie outside the object scheme

OK. I may use something like that or, possibly, some flags.

...

> Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
> and .length fields) correspond to a raw Slice as used in my lower level
> language.

Good point. I'd need slices to have the same format as strings and for
both to have flags. As there's no space for flags in the (first, past)
pair I'd need to add a flags word, making the structure

first
past
misc
memfirst
mempast

where misc would store various pieces of information, not just flag
bits. Slices would have only the first three fields. Strings would have
all five. Flags would indicate whether this was a string or a slice.

For me it's too early to optimise but it's worth noting that even for
64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
cache line so short string bodies could be stored in the same line,
again with flags indicating that that was so.

--
James Harris

Re: Storing strings

<tjjcjr$3gqcc$3@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1858&group=comp.lang.misc#1858

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 15:16:27 +0100
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <tjjcjr$3gqcc$3@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 29 Oct 2022 14:16:27 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d20e7d3c01307898664339a7a7665a4c";
logging-data="3697036"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18/RhMotyMGRvhOL99CAJMHJw1v5mv+z1g="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:xYu9hJtuH+eZVAbIU8AnX7xE9t4=
In-Reply-To: <tjj62d$ap3$1@gioia.aioe.org>
Content-Language: en-GB

by: James Harris - Sat, 29 Oct 2022 14:16 UTC

On 29/10/2022 13:24, Bart wrote:
> On 29/10/2022 12:23, James Harris wrote:

>> I guess there would be these kinds of string argument:
>>
>> 1. Read-write string. Anything could be done to the string by the
>> callee. (Would have to be a real string.)
>>
>> 2. Read-write fixed-length string. The string's contents could be
>> altered but it could not be made longer or shorter. (Could be a real
>> string or a slice.)
>>
>> 3. Read-only string. Neither its length nor it contents could be
>> altered by the callee. (Could be a real string or a slice.)
>
> 4. Extensible string. This is not quite the same as your (1) which
> requires only a mutable string.

You mean a string which can be made longer but the existing contents
could not be changed? I cannot think of a use case for that.

>
> You can mutate a string (alter individual characters) without needing to
> know the overall length or its allocated capacity.

Wouldn't you need to know how long the string was so that a callee could
make sure it was trying to modify characters within the string rather
than memory locations outside it?

>
> (You might further split that into mutable/non-mutable extensible
> strings. Usually if growing a string by appending to it, you don't want
> to also alter existing parts of the string.)

Mutable and extensible are good descriptions though as above I don't yet
see the value in allowing a string to be extensible but its existing
contents to be immutable.

A slice would be inextensible but could be mutable or immutable, AISI.

>
> (You probably need to consider Unicode strings too, especially if
> represented as UTF8, as the meaning of 'length' needs pinning down.)

I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
can be stored in them, including zero. It also means there's no way to
reserve a value for EOF so that condition has to be handled a different
way from what C programmers are used to where EOF is a value which is
outside the range permitted for chars. Challenges a plenty!

--
James Harris

Re: Storing strings

<tjjgu5$qnd$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1859&group=comp.lang.misc#1859

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 16:30:13 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjjgu5$qnd$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="27373"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Sat, 29 Oct 2022 15:30 UTC

On 29/10/2022 15:16, James Harris wrote:
> On 29/10/2022 13:24, Bart wrote:
>> On 29/10/2022 12:23, James Harris wrote:
>
>
>>> I guess there would be these kinds of string argument:
>>>
>>> 1. Read-write string. Anything could be done to the string by the
>>> callee. (Would have to be a real string.)
>>>
>>> 2. Read-write fixed-length string. The string's contents could be
>>> altered but it could not be made longer or shorter. (Could be a real
>>> string or a slice.)
>>>
>>> 3. Read-only string. Neither its length nor it contents could be
>>> altered by the callee. (Could be a real string or a slice.)
>>
>> 4. Extensible string. This is not quite the same as your (1) which
>> requires only a mutable string.
>
> You mean a string which can be made longer but the existing contents
> could not be changed? I cannot think of a use case for that.

That's a pattern I used all the time to incrementally build strings, for
example to generate C or ASM source files from a language app.

Or it can be as simple as this:

errormess +:= " on line "+tostr(linenumber)

Once extended, the existing parts of the string are never modified.

Perhaps you can give an example of where mutating the characters of a
string, extensible or otherwise, comes in useful.

(My strings generally are mutable, but it's not a feature I use a great
deal.

For applications like text editors, I use a list of strings, one per
line. And editing within each line create a new string for each edit.
Efficiency here is not critical, and the needs are diverse, like
deleting within the string, or insertion. It's just easier to construct
a new one.)

>
>>
>> You can mutate a string (alter individual characters) without needing
>> to know the overall length or its allocated capacity.
>
> Wouldn't you need to know how long the string was so that a callee could
> make sure it was trying to modify characters within the string rather
> than memory locations outside it?

My point is that, given only the string pointer and an index or offset
into it, that's all that's needed to modify it. If slices at least are
used, then the callee could do bounds checking /if it wanted/.

(My dynamic language does do runtime checking of bounds but, once an
application has been developed, it is very, very rare that I have a
bounds error come up. In a working, debugged program, it should not be
necessary.)

>>
>> (You might further split that into mutable/non-mutable extensible
>> strings. Usually if growing a string by appending to it, you don't
>> want to also alter existing parts of the string.)
>
> Mutable and extensible are good descriptions though as above I don't yet
> see the value in allowing a string to be extensible but its existing
> contents to be immutable.
>
> A slice would be inextensible but could be mutable or immutable, AISI.
>
>>
>> (You probably need to consider Unicode strings too, especially if
>> represented as UTF8, as the meaning of 'length' needs pinning down.)
>
> I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
> can be stored in them, including zero. It also means there's no way to
> reserve a value for EOF so that condition has to be handled a different
> way from what C programmers are used to where EOF is a value which is
> outside the range permitted for chars. Challenges a plenty!

But you're not using all 2**32 bit patterns? It could reserve -1 or all
1s for EOF just like C does. Because EOF would generally be used for
character-at-a-time streaming, which is typically 8-bit anyway.

Or have you developed a binary file system which works with 32-bit-wide
'bytes'?

Re: Storing strings

<tjjomr$3jpit$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1860&group=comp.lang.misc#1860

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 18:42:51 +0100
Organization: A noiseless patient Spider
Lines: 158
Message-ID: <tjjomr$3jpit$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me> <tjjgu5$qnd$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 29 Oct 2022 17:42:52 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d20e7d3c01307898664339a7a7665a4c";
logging-data="3794525"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19cxh+1Csg1MVLuAhdDQsbpYoLRbjpe+WU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:LOgjUSgmZkP5lXDvcms8+JTX7zE=
Content-Language: en-GB
In-Reply-To: <tjjgu5$qnd$1@gioia.aioe.org>

by: James Harris - Sat, 29 Oct 2022 17:42 UTC

On 29/10/2022 16:30, Bart wrote:
> On 29/10/2022 15:16, James Harris wrote:
>> On 29/10/2022 13:24, Bart wrote:
>>> On 29/10/2022 12:23, James Harris wrote:
>>
>>
>>>> I guess there would be these kinds of string argument:
>>>>
>>>> 1. Read-write string. Anything could be done to the string by the
>>>> callee. (Would have to be a real string.)
>>>>
>>>> 2. Read-write fixed-length string. The string's contents could be
>>>> altered but it could not be made longer or shorter. (Could be a real
>>>> string or a slice.)
>>>>
>>>> 3. Read-only string. Neither its length nor it contents could be
>>>> altered by the callee. (Could be a real string or a slice.)
>>>
>>> 4. Extensible string. This is not quite the same as your (1) which
>>> requires only a mutable string.
>>
>> You mean a string which can be made longer but the existing contents
>> could not be changed? I cannot think of a use case for that.
>
> That's a pattern I used all the time to incrementally build strings, for
> example to generate C or ASM source files from a language app.
>
> Or it can be as simple as this:
>
> errormess +:= " on line "+tostr(linenumber)
>
> Once extended, the existing parts of the string are never modified.

Good examples. The 'extend' permission seems a bit specific although I
accept that the uses you mention are common. I suppose it adds to the
security of the language to be able to designate a string as
extensible/inextensible separately from designating whether its existing
contents can be changed or not.

How would it be used? Thinking about functions which take a string as
input, most strings would be purely inputs. They would therefore be both
read-only and inextensible within the called function. Such arguments
could be strings or slices.

Further, functions which /return/ a string would create the string and
return it whole.

It is only functions which /modify/ a string, i.e. take it as an inout
parameter, where it would matter whether the string was read/write or
extensible. For an inout string what should be the defaults? If we say
an inout string defaults to immutable and inextensible then that would
lead to the following ways to specify a string, s, as a parameter:

f: function(s: inout string char)
f: function(s: inout string char rw)
f: function(s: inout string char ext rw)
f: function(s: inout string char ext)

Note the "ext" and "rw" attributes. The idea is that they would specify
how the string could be modified in the function. Adding rw would allow
the string's existing contents to be taken as read-write rather than
read-only. Adding ext would allow the string to be extended.

That's effectively me thinking out loud and trying out some ideas. How
does it look to you?

What about other permissions such as prepend, split, insert, delete,
etc? Perhaps it's too specific to have too many qualifiers although I
can see value in using such info to help match caller and callee. For
example, given the above one could say that as long as the callee
doesn't specify the string as ext then it could be either a string or a
slice. That is appealing from a security perspective.

That said, can a compiler ensure that a string is not used in a way
which breaks the contract indicated by its keywords? You raise some big
issues!

>
> Perhaps you can give an example of where mutating the characters of a
> string, extensible or otherwise, comes in useful.

I intend a string to be simply an array whose length can be changed. The
idea being that a program could have a string of integers, a string of
floats etc just as easily as having a string of characters. As such,
anything which changes the content of an array should also work on
strings. For example, one might want to sort an array in place. As a
string of characters one might want to convert lower case to upper case,
etc.

>
> (My strings generally are mutable, but it's not a feature I use a great
> deal.
>
> For applications like text editors, I use a list of strings, one per
> line. And editing within each line create a new string for each edit.
> Efficiency here is not critical, and the needs are diverse, like
> deleting within the string, or insertion. It's just easier to construct
> a new one.)

OK.

...

>>>
>>> (You might further split that into mutable/non-mutable extensible
>>> strings. Usually if growing a string by appending to it, you don't
>>> want to also alter existing parts of the string.)
>>
>> Mutable and extensible are good descriptions though as above I don't
>> yet see the value in allowing a string to be extensible but its
>> existing contents to be immutable.
>>
>> A slice would be inextensible but could be mutable or immutable, AISI.
>>
>>>
>>> (You probably need to consider Unicode strings too, especially if
>>> represented as UTF8, as the meaning of 'length' needs pinning down.)
>>
>> I haven't mentioned it but ATM my chars are 32-bit and any 32-bit
>> value can be stored in them, including zero. It also means there's no
>> way to reserve a value for EOF so that condition has to be handled a
>> different way from what C programmers are used to where EOF is a value
>> which is outside the range permitted for chars. Challenges a plenty!
>
> But you're not using all 2**32 bit patterns? It could reserve -1 or all
> 1s for EOF just like C does. Because EOF would generally be used for
> character-at-a-time streaming, which is typically 8-bit anyway.

As above, the language is meant to treat strings as arrays. So AISI it
should not ascribe any particular meaning to their contents.

There are other ways. For example, my plan for EOF is twofold:

1. to have it as an attribute of a file object

2. to have an attempt to read at EOF throw a weak exception which would
be a catchable way to end an iteration.

>
> Or have you developed a binary file system which works with 32-bit-wide
> 'bytes'?

No, my system is nothing like that advanced. At present all bytes
(octets) I read from disk are zero extended to 32 bits. And all chars I
write to disk have their top 24 zero bits chopped off. Though please
don't think that's by design. It's only a temporary measure while I get
the compiler up and running properly. (The compiler and the compilable
language are, at present, rather limited.) In the long term IO streams
should be via typed channels where chars of octets (or some other size)
could be handled natively.

--
James Harris

Re: Storing strings

<tjk1ok$71a$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1861&group=comp.lang.misc#1861

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!lYnhq7byp2KtY/MFJZaCTw.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 22:17:27 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tjk1ok$71a$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="7210"; posting-host="lYnhq7byp2KtY/MFJZaCTw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Dmitry A. Kazakov - Sat, 29 Oct 2022 20:17 UTC

On 2022-10-29 13:23, James Harris wrote:
> On 27/10/2022 08:28, Dmitry A. Kazakov wrote:
>> On 2022-10-26 21:43, James Harris wrote:
>>> On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
>>>> On 2022-10-24 12:31, James Harris wrote:
>>>>> Do you guys have any thoughts on the best ways for strings of
>>>>> characters to be stored?
>>>>>
>>>>> 1. There's the C way, of course, of reserving one value (zero) and
>>>>> using it as a terminator.
>>>>>
>>>>> 2. There's the 'length prefix' option of putting the length of the
>>>>> string in a machine word before the characters.
>>>>>
>>>>> 3. There's the 'double pointer' way of pointing at, say, first and
>>>>> past (where 'past' means first plus length such that the second
>>>>> pointer points one position beyond the last character).
>>>>>
>>>>> Any others?
>>>>
>>>> 4. String body only. The constraints are known outside.
>>>>
>>>> This is the way string slices and fixed length strings are
>>>> implemented. In the later case the compiler knows the strings bounds
>>>> (first and last indices and thus the length). In the former case the
>>>> compiler passes a "string dope" along with the naked body. The dope
>>>> contains the bounds.
>>>
>>> That doesn't seem meaningfully different from case 3. To be clear,
>>> case 3 would be represented by, in addition to the bytes of the string,
>>>
>>> struct
>>>    first: pointer to first byte of string
>>>    past: pointer to byte after last byte of string
>>>    .... other fields ....
>>> end struct
>>>
>>> The string length would be past - first. The bytes of the string
>>> would be those pointed at (which I presume is what you are calling
>>> the naked body).
>>
>> That is the structure of a string dope, not the string itself, unless
>> you have the body in other fields, but then why would you need pointers?
>
> Curious use of terms. I presume that by "dope" you mean a dope vector
> which can also be called a control block or a descriptor.
>
> As for this specific case, the same information can be conveyed in
> different ways: (start, length), (start, memsize), (first, last),
> (first, past). I chose the latter as it should be slightly faster than
> the others and does not run into problems when the elements are other
> than single bytes.

Yes. The problem with (first, next) is that next could be inexpressible.
Most difficulties arise with strings/arrays over enumerations and
modular types. (first, last) has no such problem.

Both have issues with empty strings, e.g. with a multitude of
representations of. Compare with +/-0 problem for non-2-complement integers.

> Using (first, past) should be as simple as that. By contrast, the
> similar (first, last) runs into a slight problem when elements are wider
> than single bytes: should the last pointer point to the start or the end
> of the last item?

Just do not use multibyte representations at all. E.g. UTF-8 string is
represented by an array of *octets*. It has a view of an array of code
points, but that is not the physical representation, only a view.

>> To clarify terms. String representation must include the string body
>> if we are talking about values of strings. The things like pointers
>> and vectorized dopes are references to a string, not strings. You can
>> pass a string by a reference, sure. But the string value is somewhere
>> else. What you pass is not a string it is a substitute.
>
> That depends, surely, on how "a string" is defined.

def String is a sequence of characters.

There is not other definitions. Little depends on that because there is
no requirement to represent string this way. You are completely free to
choose any suitable representation.

[...]

Skipped description of a possible representation.

>>>> This has an effect on pointers. E.g. if you want slices and
>>>> efficient raw strings you must distinguish pointers to definite
>>>> (constrained) vs. indefinite (unconstrained) objects of same type.
>>>>
>>>> E.g. in Ada you cannot take an indefinite string pointer to a fixed
>>>> length string because there is no bounds. If you wanted that feature
>>>> you would use a "fat pointer" to carry bounds with it.
>>>
>>> Any reason you'd recommend against storing bounds as in the struct,
>>> above?
>>
>> Start with interoperability of strings and slices of. The crucial
>> requirements would be:
>>
>> A slice can be passed to a subprogram expecting a string without
>> copying.
>
> Indeed, that's a major benefit of slices, IMO, being able to pass
> something which looks and acts like a string but which doesn't need the
> elements of the string to be copied.
>
> That said, a slice would probably have a length which the callee can
> determine but which the callee cannot change. I presume that's what
> you'd call a constraint.

It could be a constraint for fixed length slices.

> If a callee wanted to be able to change the length of a string then it
> would have to be passed a real string, not a slice.

A callee might pass a variable length slice, which, for example, can be
enlarged or shortened. Many languages with dynamically allocated strings
have this. You need to find some balance between flexibility of
pool-allocated strings and efficiency of fixed length ones. If the
language has a developed type system you can have both transparently
interchangeable for the programmer. Note this is same discussion as with
numbers. Programmers want all of them with an ability to pass one for
another.

> I guess there would be these kinds of string argument:
>
> 1. Read-write string. Anything could be done to the string by the
> callee. (Would have to be a real string.)
>
> 2. Read-write fixed-length string. The string's contents could be
> altered but it could not be made longer or shorter. (Could be a real
> string or a slice.)
>
> 3. Read-only string. Neither its length nor it contents could be altered
> by the callee. (Could be a real string or a slice.)

Think of it in terms of constraints. Immutability is a constraint. Fixed
length is a constraint. Bounded length is a constraint. Non-sliding
lower bound is a constraint. Non-sliding upper bound is a constraint.

This should cover all spectrum. You can express all cases in terms of
constraints.

>> Consider efficiency and low-level close to hardware stuff:
>>
>> Aggregation of strings with known bounds does not require storing
>> them.
>>
>> E.g. you can have arrays of fixed length strings (like an image
>> buffer). If a member of a structure is a fixed length string, no
>> bounds are stored. A pointer to a fixed length string is a plain
>> pointer etc.
>
> You mean the string bounds could be known at compile time, say, rather
> than at run time. Good point. Any suggestions on how that should be
> implemented?

As I said, you just have the string body and nothing else in the
representation. Compare it to numbers. You can have indefinite length
integers, but for many reasons programmers stick to constrained variants
like -2**15..2*15-1.

>>>> This is similar to atomic, volatile objects and pointers to. The
>>>> mechanics is same. You cannot take a general-purpose pointer to an
>>>> atomic object, because the client code would not know that it should
>>>> take care upon dereferencing.
>>>
>>> I am not sure what that means. I guess the point you are making is
>>> that there are levels of classification which don't affect the data
>>> type but they do affect how it can be accessed - with the language
>>> needing to prevent a reference weakening the storage model. For
>>> example, a read-write reference to a substring should be prevented
>>> from being used to access part of a string which is supposed to be
>>> read-only.
>>
>> Yes, it is a type constraint. There are all sorts of constraints one
>> could put on a type in order to produce a constrained subtype.
>> Constraining limits operations, e.g. immutability removes mutators. It
>> also directs certain implementations like using locking instructions
>> or dropping known bounds.
>
> Was with you all the way until you mentioned dropping known bounds. What
> does that mean? How can it be legitimate to drop any bounds?

Click here to read the complete article

Re: Storing strings

<tjk4dn$1bjf$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1862&group=comp.lang.misc#1862

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 22:02:48 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjk4dn$1bjf$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjj62d$ap3$1@gioia.aioe.org>
<tjjcjr$3gqcc$3@dont-email.me> <tjjgu5$qnd$1@gioia.aioe.org>
<tjjomr$3jpit$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="44655"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Sat, 29 Oct 2022 21:02 UTC

On 29/10/2022 18:42, James Harris wrote:
> On 29/10/2022 16:30, Bart wrote:

> Further, functions which /return/ a string would create the string and
> return it whole.

Not necessarily. My dynamic language can return a string which is a
slice into another. (Slices are not exposed in this language; they are
in the static one, where slices are distinct types.)

Example:

func trim(s) =
if s.len=2 then return "" fi
return s[2..$-1]
end

This trims the first and last character of string. But here it returns a
slice into the original string. If I wanted a fresh copy, I'd have to
use copy() inside the function, or copy() (or a special kind of
assignment) outside it.

> It is only functions which /modify/ a string, i.e. take it as an inout
> parameter, where it would matter whether the string was read/write or
> extensible. For an inout string what should be the defaults? If we say
> an inout string defaults to immutable and inextensible then that would
> lead to the following ways to specify a string, s, as a parameter:
>
>
> f: function(s: inout string char)
> f: function(s: inout string char rw)
> f: function(s: inout string char ext rw)
> f: function(s: inout string char ext)
>
> Note the "ext" and "rw" attributes. The idea is that they would specify
> how the string could be modified in the function. Adding rw would allow
> the string's existing contents to be taken as read-write rather than
> read-only. Adding ext would allow the string to be extended.
>
> That's effectively me thinking out loud and trying out some ideas. How
> does it look to you?

My preference is to keep it simple:

(1) String parameters are immutable

(2) String parameters are mutable (this is changing existing
content but also allow extension, plus deletion etc - the
works)

(3) String parameter are assignable. This means that, in addition to
(2), assigning to the parameter also replaces the caller's version

Python allows only (2), when working with Lists, and only (1) when
working with Strings (Strings are immutable, Lists are mutable)

(3) Requires full reference parameters so won't work in Python.

My scripting language allows (2) and (3) on both lists and strings. (1)
is only possible by a flag within the object that renders it immutable
(for example, passing a literal "ABC").

When I mentioned having extensibility as a different capability than
mutation, it is because this could be done via different string types.

There is in-place modification which changes the length of the object,
and modification where the length is not changed; I think these could be
useful, distinct attributes.

Changing the length requires a reference to the /original/ descriptor
where all the info is stored (heap pointer, length, capacity).

But changing the contents without affecting the size either only needs
the heap pointer, or can be done with a /copy/ of the descriptor; it
doesn't not need the original.

(My first implementation of a string type, on a 16- then 32-bit machine,
used a 16-byte descriptor passed by value. The string was mutable, but
it was not possible to extend it without a proper reference.)

> What about other permissions such as prepend, split, insert, delete,
> etc?

These all count as in-place modifications (except split), but as I said
above, it might be useful to treat length-modifying ones differently.

It's not clear how 'split' works, but there are anyway all sorts of
string ops that are not 'in-place'; they simply create new strings.
Presumably 'split' creates 2 or more new strings.

> I intend a string to be simply an array whose length can be changed.

I treat a string as one composite object normally treated as a single
value (like a record). I treat an array or list a collection of distinct
objects. But this is a minor point (it affects hows [] indexing works).

Re: Storing strings

<tjk5oi$1s3k$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1863&group=comp.lang.misc#1863

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sat, 29 Oct 2022 22:25:39 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjk5oi$1s3k$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj67es$1bu4$1@gioia.aioe.org>
<tjc5jn$2hpui$3@dont-email.me> <tjdp6i$2cr$1@gioia.aioe.org>
<tjjbnu$3gqcc$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="61556"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Sat, 29 Oct 2022 21:25 UTC

On 29/10/2022 15:01, James Harris wrote:
> On 27/10/2022 12:14, Bart wrote:

>> Most strings are fixed-length once created; strings that can grow are
>> rare. You don't need a 'capacity' field for example (like C++'s Vector
>> type).
>
> Having watched some videos on string storage recently I now think I know
> what you mean by the capacity field - basically that a string descriptor
> would consist of these fields:
>
> start
> length
> capacity
>
> so that the string could be extended at the end (up to the capacity).
> That may be a bit restrictive. A programmer might want to remove or add
> characters at the beginning rather than just at the end, even though
> such would be done less often.

Doing a prepend is not a problem. What's critical is whether the new
length is still within the current allocation. (Prepend requires
shifting of the old string so is less efficient anyway.)

If a new allocation is needed, you may be copying data for both prepend
and append.

With delete however, you may need to think about whether to /reduce/ the
allocation size.

> So what do you think of having a string descriptor more like
>
> first
> past
> memfirst
> mempast
>
> where memfirst and mempast would define the allocated space in which the
> string body would sit.

What's the difference between 'first' and 'memfirst'? Would you have a
string that doesn't start at the beginning of its allocated block?

>> An important field is objtype; its values are:
>>
>>      Normal            Regular string (uses alloc64)
>>      Slice             Slice into another (uses objptr2)
>>      Extslice          Strings lie outside the object scheme
>
> OK. I may use something like that or, possibly, some flags.
>
> ..
>
>> Note that if you take those 32 bytes, then the middle 16 bytes
>> (.strptr and .length fields) correspond to a raw Slice as used in my
>> lower level language.
>
> Good point. I'd need slices to have the same format as strings and for
> both to have flags. As there's no space for flags in the (first, past)
> pair I'd need to add a flags word, making the structure
>
> first
> past
> misc
> memfirst
> mempast
>
> where misc would store various pieces of information, not just flag
> bits. Slices would have only the first three fields. Strings would have
> all five. Flags would indicate whether this was a string or a slice.
>
> For me it's too early to optimise but it's worth noting that even for
> 64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
> cache line so short string bodies could be stored in the same line,
> again with flags indicating that that was so.

I think that if your string implementation requires a 24 or 40-byte
descriptor, then thinking about cache-line optimisation /is/ premature!

I considered such a descriptor too heavyweight for my static language.
(I did incorporate such a string type once, intended for uses where
performance didn't matter: sorting out UI, printing error messages and
diagnostics, that sort of thing.)

In the end I decided it did't really fit. But then I have two languages.

I guess yours likely sits between my two.

Re: Storing strings

<tjlmth$4kfm$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1864&group=comp.lang.misc#1864

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: james.harris.1@gmail.com (James Harris)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 11:24:33 +0000
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <tjlmth$4kfm$1@dont-email.me>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjk1ok$71a$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 30 Oct 2022 11:24:33 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="69052b4bd6df8310e5f54633e4d1bcde";
logging-data="152054"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/04Eu4xaTMmsgwSnjZMkuNV2tS6D3Il6o="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Cancel-Lock: sha1:Ws+zlgmuY8iCZ5pm9/JFMFW6GhY=
Content-Language: en-GB
In-Reply-To: <tjk1ok$71a$1@gioia.aioe.org>

by: James Harris - Sun, 30 Oct 2022 11:24 UTC

On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
> On 2022-10-29 13:23, James Harris wrote:

...

>> As for this specific case, the same information can be conveyed in
>> different ways: (start, length), (start, memsize), (first, last),
>> (first, past). I chose the latter as it should be slightly faster than
>> the others and does not run into problems when the elements are other
>> than single bytes.
>
> Yes. The problem with (first, next) is that next could be inexpressible.
> Most difficulties arise with strings/arrays over enumerations and
> modular types. (first, last) has no such problem.
>
> Both have issues with empty strings, e.g. with a multitude of
> representations of. Compare with +/-0 problem for non-2-complement
> integers.

That sounds interesting. Do you see multiple representations of the
empty string in the following? Monospacing required. Here's how the
string "abcd" would be stored

!_a_!_b_!_c_!_d_!

^ ^
! !
first past

* so first would point at the first element of the string
* and past would point one cell beyond the last element of the string.

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first in
all cases.

...

>> That said, a slice would probably have a length which the callee can
>> determine but which the callee cannot change. I presume that's what
>> you'd call a constraint.
>
> It could be a constraint for fixed length slices.
>
>> If a callee wanted to be able to change the length of a string then it
>> would have to be passed a real string, not a slice.
>
> A callee might pass a variable length slice, which, for example, can be
> enlarged or shortened. Many languages with dynamically allocated strings
> have this.

What is your definition of a slice? Is it /part/ of an underlying string
or is it a /copy/ of part of a string? For example, if

string S = "abcde"
slice T = S[1..3] ;"bcd"

then changes to T would do what to S?

If slice is a view of an underlying string (which is what I had in mind)
then I don't get how you could meaningfully enlarge or shorten it.

...

>> I guess there would be these kinds of string argument:
>>
>> 1. Read-write string. Anything could be done to the string by the
>> callee. (Would have to be a real string.)
>>
>> 2. Read-write fixed-length string. The string's contents could be
>> altered but it could not be made longer or shorter. (Could be a real
>> string or a slice.)
>>
>> 3. Read-only string. Neither its length nor it contents could be
>> altered by the callee. (Could be a real string or a slice.)
>
> Think of it in terms of constraints. Immutability is a constraint. Fixed
> length is a constraint. Bounded length is a constraint. Non-sliding
> lower bound is a constraint. Non-sliding upper bound is a constraint.
>
> This should cover all spectrum. You can express all cases in terms of
> constraints.

I presume such constraints would be specified when objects are declared.
As a programmer how would you want to specify such constraints? Would
each have a reserved word, for example?

--
James Harris

Re: Storing strings

<tjm16h$vrp$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1865&group=comp.lang.misc#1865

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!lYnhq7byp2KtY/MFJZaCTw.user.46.165.242.91.POSTED!not-for-mail
From: mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 15:20:03 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tjm16h$vrp$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjk1ok$71a$1@gioia.aioe.org>
<tjlmth$4kfm$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="32633"; posting-host="lYnhq7byp2KtY/MFJZaCTw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Sun, 30 Oct 2022 14:20 UTC

On 2022-10-30 12:24, James Harris wrote:
> On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
>> On 2022-10-29 13:23, James Harris wrote:
>
> ..
>
>>> As for this specific case, the same information can be conveyed in
>>> different ways: (start, length), (start, memsize), (first, last),
>>> (first, past). I chose the latter as it should be slightly faster
>>> than the others and does not run into problems when the elements are
>>> other than single bytes.
>>
>> Yes. The problem with (first, next) is that next could be
>> inexpressible. Most difficulties arise with strings/arrays over
>> enumerations and modular types. (first, last) has no such problem.
>>
>> Both have issues with empty strings, e.g. with a multitude of
>> representations of. Compare with +/-0 problem for non-2-complement
>> integers.
>
> That sounds interesting. Do you see multiple representations of the
> empty string in the following? Monospacing required. Here's how the
> string "abcd" would be stored
>
> !_a_!_b_!_c_!_d_!
>
>     ^               ^
>     !               !
> first            past
>
> * so first would point at the first element of the string
> * and past would point one cell beyond the last element of the string.
>
> I don't see where you see a multitude of representations of the null
> string. AISI the empty string would simply have past equal to first in
> all cases.

...
(0..0)
(1..1)
(2..2)
...
(n..n)
...

With pointers it becomes even worse as some of them might point to
invalid addresses.

>>> That said, a slice would probably have a length which the callee can
>>> determine but which the callee cannot change. I presume that's what
>>> you'd call a constraint.
>>
>> It could be a constraint for fixed length slices.
>>
>>> If a callee wanted to be able to change the length of a string then
>>> it would have to be passed a real string, not a slice.
>>
>> A callee might pass a variable length slice, which, for example, can
>> be enlarged or shortened. Many languages with dynamically allocated
>> strings have this.
>
> What is your definition of a slice? Is it /part/ of an underlying string
> or is it a /copy/ of part of a string? For example, if
>
> string S = "abcde"
> slice T = S[1..3] ;"bcd"
>
> then changes to T would do what to S?

No idea. It depends. Is slice in your example an independent object?

But considering this:

declare
S : String := "abcde";
begin
S (1..3) := "x"; -- Illegal in Ada

But should it be legal, then the result would be

"xde"

Many implementations make this illegal because it would require either
bounded or dynamically allocated unbounded string.

You can consider make it legal for these, but then you would have
different semantics of slices for different strings. And this would
contradict the design principle of having all strings interchangeable
regardless the implementation method.

There are contradictions in requirements you as the language designer
has to resolve this or that way.

> If slice is a view of an underlying string (which is what I had in mind)
> then I don't get how you could meaningfully enlarge or shorten it.

It is only your limited understanding of view as immutable and fixed
length. E.g. if you view a house in infrared why should not you be able
to open its door? Infrared googles would not limit you. Infrared photo
of a house would! (:-))

>>> I guess there would be these kinds of string argument:
>>>
>>> 1. Read-write string. Anything could be done to the string by the
>>> callee. (Would have to be a real string.)
>>>
>>> 2. Read-write fixed-length string. The string's contents could be
>>> altered but it could not be made longer or shorter. (Could be a real
>>> string or a slice.)
>>>
>>> 3. Read-only string. Neither its length nor it contents could be
>>> altered by the callee. (Could be a real string or a slice.)
>>
>> Think of it in terms of constraints. Immutability is a constraint.
>> Fixed length is a constraint. Bounded length is a constraint.
>> Non-sliding lower bound is a constraint. Non-sliding upper bound is a
>> constraint.
>>
>> This should cover all spectrum. You can express all cases in terms of
>> constraints.
>
> I presume such constraints would be specified when objects are declared.

Objects and/or subtypes. Depending on the language preferences. Note
also that you can have constrained views of the same object. E.g. you
have a mutable variable passed down as in-argument. That would be an
immutable view of the same object.

> As a programmer how would you want to specify such constraints? Would
> each have a reserved word, for example?

In some cases constraints might be implied. But usually language have
lots of [sub]type modifiers like

in, in out, out, constant
atomic, volatile, shared
aliased (can get pointers to)
external, static
public, private, protected (visibility constraints)
range, length, bounds
parameter AKA discriminant (general purpose constraint)
specific type AKA static/dynamic up/downcast (view as another type)
class-wide (view as a class of types rooted in this one)
...
measurement unit

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: Storing strings

<tjm328$1mqd$1@gioia.aioe.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1866&group=comp.lang.misc#1866

copy link Newsgroups: comp.lang.misc

Path: i2pn2.org!i2pn.org!aioe.org!uabYU4OOdxBKlV2hpj27FQ.user.46.165.242.75.POSTED!not-for-mail
From: bc@freeuk.com (Bart)
Newsgroups: comp.lang.misc
Subject: Re: Storing strings
Date: Sun, 30 Oct 2022 14:51:53 +0000
Organization: Aioe.org NNTP Server
Message-ID: <tjm328$1mqd$1@gioia.aioe.org>
References: <tj5phf$1lggf$2@dont-email.me> <tj5v64$od0$1@gioia.aioe.org>
<tjc2kf$2hget$1@dont-email.me> <tjdbvk$1sps$1@gioia.aioe.org>
<tjj2eu$3eqiv$2@dont-email.me> <tjk1ok$71a$1@gioia.aioe.org>
<tjlmth$4kfm$1@dont-email.me> <tjm16h$vrp$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="56141"; posting-host="uabYU4OOdxBKlV2hpj27FQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.0
X-Notice: Filtered by postfilter v. 0.9.2

by: Bart - Sun, 30 Oct 2022 14:51 UTC

On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
> On 2022-10-30 12:24, James Harris wrote:
>> On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
>>> On 2022-10-29 13:23, James Harris wrote:
>>
>> ..
>>
>>>> As for this specific case, the same information can be conveyed in
>>>> different ways: (start, length), (start, memsize), (first, last),
>>>> (first, past). I chose the latter as it should be slightly faster
>>>> than the others and does not run into problems when the elements are
>>>> other than single bytes.
>>>
>>> Yes. The problem with (first, next) is that next could be
>>> inexpressible. Most difficulties arise with strings/arrays over
>>> enumerations and modular types. (first, last) has no such problem.
>>>
>>> Both have issues with empty strings, e.g. with a multitude of
>>> representations of. Compare with +/-0 problem for non-2-complement
>>> integers.
>>
>> That sounds interesting. Do you see multiple representations of the
>> empty string in the following? Monospacing required. Here's how the
>> string "abcd" would be stored
>>
>>    !_a_!_b_!_c_!_d_!
>>
>>      ^               ^
>>      !               !
>>    first            past
>>
>> * so first would point at the first element of the string
>> * and past would point one cell beyond the last element of the string.
>>
>> I don't see where you see a multitude of representations of the null
>> string. AISI the empty string would simply have past equal to first in
>> all cases.
>
>    ...
>    (0..0)
>    (1..1)
>    (2..2)
>    ...
>    (n..n)
>    ...
>
> With pointers it becomes even worse as some of them might point to
> invalid addresses.

I don't know what these numbers mean. The main problem with 'first' and
'past' is that with an empty string, 'first' doesn't point anywhere, and
'past' ends up pointing to that same place, wherever that is.

I don't like it because that address is meaningless. Except possibly
when refering to an empty slice of an actual string.

>> string S = "abcde"
>> slice T = S[1..3] ;"bcd"
>>
>> then changes to T would do what to S?

Let's try it:

s ::= "abcde" # ::= is needed to make s (and t) mutable
t := s[1..3]
t[2]:="?"

println s # a?cde
println t # a?c

The language doesn't allow an empty slice, say s[1..0], although it
ought to be well-behaved (I think it just expects j>=i in s[i..j].)

> No idea. It depends. Is slice in your example an independent object?
>
> But considering this:
>
> declare
> S : String := "abcde";
> begin
> S (1..3) := "x"; -- Illegal in Ada
>
> But should it be legal, then the result would be
>
> "xde"
>
> Many implementations make this illegal because it would require either
> bounded or dynamically allocated unbounded string.

The language gets to say how this works. In mine it would have to be
like this:

s ::= "abcde" # ::= creates a mutable copy
s[1..3] := "xyz" # Can only insert string of matching length

s ends up as "xyzde"

Re: Storing strings

<8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1867&group=comp.lang.misc#1867

copy link Newsgroups: comp.lang.misc

X-Received: by 2002:a05:620a:294f:b0:6ee:b598:2625 with SMTP id n15-20020a05620a294f00b006eeb5982625mr6528926qkp.415.1667146905913;
Sun, 30 Oct 2022 09:21:45 -0700 (PDT)
X-Received: by 2002:a05:6214:2b06:b0:4bb:5716:d1c3 with SMTP id
jx6-20020a0562142b0600b004bb5716d1c3mr7646756qvb.85.1667146905746; Sun, 30
Oct 2022 09:21:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.misc
Date: Sun, 30 Oct 2022 09:21:45 -0700 (PDT)
In-Reply-To: <tj5phf$1lggf$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=24.107.184.18; posting-account=G1KGwgkAAAAyw4z0LxHH0fja6wAbo7Cz
NNTP-Posting-Host: 24.107.184.18
References: <tj5phf$1lggf$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8c53e3a9-c339-4bba-a8d1-c05443ee5bcen@googlegroups.com>
Subject: Re: Storing strings
From: mijoryx@yahoo.com (luserdroog)
Injection-Date: Sun, 30 Oct 2022 16:21:45 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3526

by: luserdroog - Sun, 30 Oct 2022 16:21 UTC

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
> Do you guys have any thoughts on the best ways for strings of characters
> to be stored?
>
> 1. There's the C way, of course, of reserving one value (zero) and using
> it as a terminator.
>
> 2. There's the 'length prefix' option of putting the length of the
> string in a machine word before the characters.
>
> 3. There's the 'double pointer' way of pointing at, say, first and past
> (where 'past' means first plus length such that the second pointer
> points one position beyond the last character).
>
> Any others?
>
> Options 1 and 2 have the advantage that they can be referred to simply
> by address. Option 3 needs an additional place in which to store the
> (first, past) control block.
>
> Option 1 has the advantage that it's easy for a program to process (by
> either pointer or index).
>
> Options 1 and 3 have the advantage that one can refer to the tail of the
> string (anything past the first character) without creating a copy,
> although option 3 would need a new control block to be created. Option 2
> would require a new string to be created.
>
> In fact, option 3 has the advantage that it allows any continuous
> substring - head, mid, or tail - to be referred to without making a copy
> of the required part of the string.
>
> Options 2 and 3 make it fast to find the length. They also allow any
> value (i.e. including zero) to be part of the string.
>
> So: Which of those should a compiler support? Should it support more
> than one form? If so, should the language allow the programmer to
> specify which form to use on any particular string?
>
> If that's not complicated enough, the above essentially considers
> strings whose contents could be read-only or read-write but their
> lengths don't change. If the lengths can change then there are
> additional issues of storage management. Eek! ;)
>
> Recommendations welcome!
>

I think an exhaustive list of options would be very large if you're not
pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
whole byte for metadata attached to each character.

....

Pages:12 3

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor