RetroBBS - comp.unix.programmer - Re: UTF-8 overlong encodings

UTF-8 overlong encodings

<874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>

https://www.rocksolidbbs.com/devel/article-flat.php?id=16955&group=comp.unix.programmer#16955

copy link Newsgroups: comp.unix.programmer

by: Rainer Weikusat - Fri, 17 Dec 2021 22:55 UTC

As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

Re: UTF-8 overlong encodings

<87y24ird6y.fsf@bsb.me.uk>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16956&group=comp.unix.programmer#16956

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sat, 18 Dec 2021 00:04:05 +0000
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <87y24ird6y.fsf@bsb.me.uk>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="24d39d537cd44dc139dae959883222c2";
logging-data="18023"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ngrt2C20cBUxdQpfn5o4KzMSYPiQ7YW8="
Cancel-Lock: sha1:DxetF5nZoeyClDgPcARpuu/PrV0=
sha1:AODi8TT9FbVpcU/qC3GhNPC8qIA=
X-BSB-Auth: 1.83c80b2e432a4465b3b7.20211218000405GMT.87y24ird6y.fsf@bsb.me.uk

by: Ben Bacarisse - Sat, 18 Dec 2021 00:04 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:

> As usual with technical terms "everyone understands", it gets thrown
> around everywhere but is never defined. The definition I derived is
> below.
>
> The non-ASCII part of UTF-8 is composed of 5 ranges each of which
> starts with a number which has only one bit set. The starting numbers
> are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
> 'overlong' when this start bit isn't set.

I'd express it in terms of magnitude. An overlong 2-byte sequence will
decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.

When looking at the encoding itself, an overlong sequence is one that
starts with one of the bytes C0, C1, E0, F0, F8 or FC.

Unicode has said it won't use more than the 21 bits available in the
four-byte encoding, so all sequences of length 5 or 6 are "overlong",
although in an entirely different sense.

> Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
> and 26.
>
> Each range is composed of a number of six bit blocks plus a
> remainder which gets put into the byte starting the encoded
> sequence. Again expressed as (left) shift arguments, the highest bits of
> the left-most six bit blocks are 5, 11, 17, 23, 29.
>
> Subtracting the shift value corresponding with the highest bit in the
> first six bit block from the shift value of the starting bit yiels the
> position of this starting bit relative to the highest bit in the first
> six bit block. The corresponding values are 2, 0, -1, -2 and -3.
>
> The first case is special because the starting bit is the bit
> corresponding with 1 in the first byte. All other start bits are in the
> second byte, at positions 5, 4, 3 and 2.
>
> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
> ignoring the initial special case, the shift value relative to the start
> of the first six bit block for each encoded sequence is 8 -
> its length:
>
> 3 -> 5
> 4 -> 4
> 5 -> 3
> 6 -> 2
>
> Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to
the initial definition.

--
Ben.

Re: UTF-8 overlong encodings

<87lf0i7npv.fsf@nosuchdomain.example.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16957&group=comp.unix.programmer#16957

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Keith.S.Thompson+u@gmail.com (Keith Thompson)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Fri, 17 Dec 2021 16:37:00 -0800
Organization: None to speak of
Lines: 82
Message-ID: <87lf0i7npv.fsf@nosuchdomain.example.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="364d79685ae1098260e47310ff6b1319";
logging-data="28079"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ef94JXqEsnsztd+d7laY/"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:4kzc6gju2KPdIg0NQ1trxyUYIhA=
sha1:VOvjidPagqcw2QanmNzeLaXPlAU=

by: Keith Thompson - Sat, 18 Dec 2021 00:37 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:
> As usual with technical terms "everyone understands", it gets thrown
> around everywhere but is never defined. The definition I derived is
> below.
>
> The non-ASCII part of UTF-8 is composed of 5 ranges each of which
> starts with a number which has only one bit set. The starting numbers
> are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
> 'overlong' when this start bit isn't set.
>
> Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
> and 26.
>
> Each range is composed of a number of six bit blocks plus a
> remainder which gets put into the byte starting the encoded
> sequence. Again expressed as (left) shift arguments, the highest bits of
> the left-most six bit blocks are 5, 11, 17, 23, 29.
>
> Subtracting the shift value corresponding with the highest bit in the
> first six bit block from the shift value of the starting bit yiels the
> position of this starting bit relative to the highest bit in the first
> six bit block. The corresponding values are 2, 0, -1, -2 and -3.
>
> The first case is special because the starting bit is the bit
> corresponding with 1 in the first byte. All other start bits are in the
> second byte, at positions 5, 4, 3 and 2.
>
> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
> ignoring the initial special case, the shift value relative to the start
> of the first six bit block for each encoded sequence is 8 -
> its length:
>
> 3 -> 5
> 4 -> 4
> 5 -> 3
> 6 -> 2
>
> Any corrections or other comments very much welcome.

Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.

Here's a table I came up with a while ago:

00-7F (7 bits) 0xxxxxxx
0080-07FF (11 bits) 110xxxxx 10xxxxxx
0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The character code is determined by concatenating the 'x's together.
A 1-octet encoding has 7 value bits.
A 2-octet encoding has 11 value bits.
And so on.

An octet starting with 0 is always a single-octet character (ASCII compatible).
An octet starting with 11 is always the first octet of a multi-octet encoding.
An octet starting with 10 is always a continuation octet.

Overlong encodings that use more octets than necessary are invalid. For
example, the letter 'k' is 0x6b or 1101011 and is encoded in a single
octet:
01101011
-------
A two-octet encoding of the same character code is invalid:
11000001 10101011
----- ------

You could extrapolate UTF-8 to up to 8-octet encodings, representing up to
42 bits, but that's also not valid UTF-8 (though I can imagine it being
useful for some purposes).

110000-3FFFFFF (26 bits) 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
4000000-7FFFFFF (31 bits) 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
8000000-FFFFFFFFF (36 bits) 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1000000000-3FFFFFFFFFF (42 bits) 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Re: UTF-8 overlong encodings

<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16959&group=comp.unix.programmer#16959

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sat, 18 Dec 2021 17:15:12 +0000
Lines: 57
Message-ID: <87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net PEpGiho9z3vac6PZgYinuw9ru7G+GsOkr5mUwBK7i76Ut8S1g=
Cancel-Lock: sha1:jWlKzZErLQyvFPKBZrftgv4KqMo= sha1:VUAUsXjWX1POCrZ/MvF88HDcY4c=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Sat, 18 Dec 2021 17:15 UTC

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Rainer Weikusat <rweikusat@talktalk.net> writes:
>
>> As usual with technical terms "everyone understands", it gets thrown
>> around everywhere but is never defined. The definition I derived is
>> below.
>>
>> The non-ASCII part of UTF-8 is composed of 5 ranges each of which
>> starts with a number which has only one bit set. The starting numbers
>> are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
>> 'overlong' when this start bit isn't set.
>
> I'd express it in terms of magnitude. An overlong 2-byte sequence will
> decode to a value than 0x80. An overlong encoded 3-byte value will be
> less than 0x800 and so on. Or going the other way, you need at least
> two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
> so on.

Yes. That's an error I made: An overlong sequence is one where none of
the bits between the end of the prefix and the start bit (inclusive) are
set.

[...]

>> Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
>> and 26.
>>
>> Each range is composed of a number of six bit blocks plus a
>> remainder which gets put into the byte starting the encoded
>> sequence. Again expressed as (left) shift arguments, the highest bits of
>> the left-most six bit blocks are 5, 11, 17, 23, 29.
>>
>> Subtracting the shift value corresponding with the highest bit in the
>> first six bit block from the shift value of the starting bit yiels the
>> position of this starting bit relative to the highest bit in the first
>> six bit block. The corresponding values are 2, 0, -1, -2 and -3.
>>
>> The first case is special because the starting bit is the bit
>> corresponding with 1 in the first byte. All other start bits are in the
>> second byte, at positions 5, 4, 3 and 2.
>>
>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>> ignoring the initial special case, the shift value relative to the start
>> of the first six bit block for each encoded sequence is 8 -
>> its length:
>>
>> 3 -> 5
>> 4 -> 4
>> 5 -> 3
>> 6 -> 2
>>
>> Any corrections or other comments very much welcome.
>
> I was not sure what this part of the description was supposed to add to
> the initial definition.

I want to calculate that with a general algorithm.

Re: UTF-8 overlong encodings

<87ilvl6dbe.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16960&group=comp.unix.programmer#16960

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sat, 18 Dec 2021 17:19:17 +0000
Lines: 19
Message-ID: <87ilvl6dbe.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87lf0i7npv.fsf@nosuchdomain.example.com>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net TkqO6O5R4jEi89fxa5vnegEb/pWRANvJ+advJiAipMtSl7niY=
Cancel-Lock: sha1:I8gLJ91pFAt2d/YPTf1fAzwBCvM= sha1:bhobsIA9ScSjUnNoGugUFRr9Pbk=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Sat, 18 Dec 2021 17:19 UTC

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
> Rainer Weikusat <rweikusat@talktalk.net> writes:
>> As usual with technical terms "everyone understands", it gets thrown
>> around everywhere but is never defined. The definition I derived is
>> below.

[...]

> Unicode only defines character values up to 0x10fffd, so there are no
> valid encodings longer than 4 octets.
>
> Here's a table I came up with a while ago:
>
> 00-7F (7 bits) 0xxxxxxx
> 0080-07FF (11 bits) 110xxxxx 10xxxxxx
> 0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
> 010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The Linux UTF-8 man page also has 5 and 6 byte sequences.

Re: UTF-8 overlong encodings

<87sfuppku0.fsf@bsb.me.uk>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16962&group=comp.unix.programmer#16962

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sat, 18 Dec 2021 23:14:15 +0000
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <87sfuppku0.fsf@bsb.me.uk>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="c78496f45d3798f7f0289f5e03f05e14";
logging-data="27283"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+F6HxWJcnmYCI9IwWWFAD93KzHBo8br40="
Cancel-Lock: sha1:T6QUWcFywqOH46aAHwlYPk0yyhM=
sha1:6x62b3k5oVndKWObq9DsE2taanE=
X-BSB-Auth: 1.1ab55b1d7b4046135d4b.20211218231415GMT.87sfuppku0.fsf@bsb.me.uk

by: Ben Bacarisse - Sat, 18 Dec 2021 23:14 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>> Rainer Weikusat <rweikusat@talktalk.net> writes:
>>
>>> As usual with technical terms "everyone understands", it gets thrown
>>> around everywhere but is never defined. The definition I derived is
>>> below.
>>>
>>> The non-ASCII part of UTF-8 is composed of 5 ranges each of which
>>> starts with a number which has only one bit set. The starting numbers
>>> are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
>>> 'overlong' when this start bit isn't set.
>>
>> I'd express it in terms of magnitude. An overlong 2-byte sequence will
>> decode to a value than 0x80. An overlong encoded 3-byte value will be
>> less than 0x800 and so on. Or going the other way, you need at least
>> two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
>> so on.
>
> Yes. That's an error I made: An overlong sequence is one where none of
> the bits between the end of the prefix and the start bit (inclusive) are
> set.
>
> [...]
>
>>> Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
>>> and 26.
>>>
>>> Each range is composed of a number of six bit blocks plus a
>>> remainder which gets put into the byte starting the encoded
>>> sequence. Again expressed as (left) shift arguments, the highest bits of
>>> the left-most six bit blocks are 5, 11, 17, 23, 29.
>>>
>>> Subtracting the shift value corresponding with the highest bit in the
>>> first six bit block from the shift value of the starting bit yiels the
>>> position of this starting bit relative to the highest bit in the first
>>> six bit block. The corresponding values are 2, 0, -1, -2 and -3.
>>>
>>> The first case is special because the starting bit is the bit
>>> corresponding with 1 in the first byte. All other start bits are in the
>>> second byte, at positions 5, 4, 3 and 2.
>>>
>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>> ignoring the initial special case, the shift value relative to the start
>>> of the first six bit block for each encoded sequence is 8 -
>>> its length:
>>>
>>> 3 -> 5
>>> 4 -> 4
>>> 5 -> 3
>>> 6 -> 2
>>>
>>> Any corrections or other comments very much welcome.
>>
>> I was not sure what this part of the description was supposed to add to
>> the initial definition.
>
> I want to calculate that with a general algorithm.

I don't know what "that" refers to. Do you want to calculate the UTF-8
sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?

--
Ben.

Re: UTF-8 overlong encodings

<splvjm$msm$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16963&group=comp.unix.programmer#16963

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jmccue@fuzzball.mhome.org (John McCue)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sun, 19 Dec 2021 00:49:58 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <splvjm$msm$1@dont-email.me>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com> <87lf0i7npv.fsf@nosuchdomain.example.com> <87ilvl6dbe.fsf@doppelsaurus.mobileactivedefense.com>
Reply-To: jmclnx@SPAMisBADgmail.com
Injection-Date: Sun, 19 Dec 2021 00:49:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1d759effbfe2b473537d1bf3ab6e7dd7";
logging-data="23446"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18f4KUXiwCTr+fUlxWY5uAl"
User-Agent: tin/2.4.4-20191224 ("Millburn") (OpenBSD/7.0 (amd64))
Cancel-Lock: sha1:O6HcOLpZT2FKmpRSdyXMrGKCBPc=
X-OS-Version: OpenBSD 7.0 amd64

by: John McCue - Sun, 19 Dec 2021 00:49 UTC

Rainer Weikusat <rweikusat@talktalk.net> wrote:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
>> Rainer Weikusat <rweikusat@talktalk.net> writes:
<snip>

>> Unicode only defines character values up to 0x10fffd, so there are no
>> valid encodings longer than 4 octets.

This is my understanding also.

<snip>
>
> The Linux UTF-8 man page also has 5 and 6 byte sequences.

I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.

--
csh(1) - "An elegant shell, for a more... civilized age."
- Paraphrasing Star Wars

Re: UTF-8 overlong encodings

<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16965&group=comp.unix.programmer#16965

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sun, 19 Dec 2021 21:39:02 +0000
Lines: 44
Message-ID: <87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net /DEabLxcEl58Lu+5QEiZhQ4+V3rf2Xso0abvybLO6z6DiANYw=
Cancel-Lock: sha1:82gOuBvgRid/b1VpMOHg//P2lgA= sha1:K2IHCZpArceSCzPwUI7vTSwn/DE=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Sun, 19 Dec 2021 21:39 UTC

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Rainer Weikusat <rweikusat@talktalk.net> writes:

[...]

>>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>>> ignoring the initial special case, the shift value relative to the start
>>>> of the first six bit block for each encoded sequence is 8 -
>>>> its length:
>>>>
>>>> 3 -> 5
>>>> 4 -> 4
>>>> 5 -> 3
>>>> 6 -> 2
>>>>
>>>> Any corrections or other comments very much welcome.
>>>
>>> I was not sure what this part of the description was supposed to add to
>>> the initial definition.
>>
>> I want to calculate that with a general algorithm.
>
> I don't know what "that" refers to. Do you want to calculate the UTF-8
> sequence length from the code point? It seems not. Do you want to
> determine if a sequence is overlong by looking at the sequence? It
> seems not. What is the algorithm given, and what it its result?

I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values. So far,
the untested (and very likely buggy) code for this looks like follows:

u_len is the length of the sequence in bytes, p a pointer to the first
byte. Some unrelated consistency checks removed.

mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
x = *p & mask;
if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

y = *++p;

if (!x) { /* x == 0 implies u_len > 2 */
mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
}

Re: UTF-8 overlong encodings

<87pmps45yi.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16966&group=comp.unix.programmer#16966

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Sun, 19 Dec 2021 21:53:25 +0000
Lines: 37
Message-ID: <87pmps45yi.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net KWff+LJnDszJ/BNWjaPAKw4mJSdkhBHbcElzQEvNkTTuI7NKI=
Cancel-Lock: sha1:pERDYD+8dpW+pZOFj7Js2ocAH6o= sha1:NB2+9ISgkPcPmbhQ57X4lchruXE=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Sun, 19 Dec 2021 21:53 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:
> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>> Rainer Weikusat <rweikusat@talktalk.net> writes:
>
> [...]
>
>>>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>>>> ignoring the initial special case, the shift value relative to the start
>>>>> of the first six bit block for each encoded sequence is 8 -
>>>>> its length:
>>>>>
>>>>> 3 -> 5
>>>>> 4 -> 4
>>>>> 5 -> 3
>>>>> 6 -> 2
>>>>>
>>>>> Any corrections or other comments very much welcome.
>>>>
>>>> I was not sure what this part of the description was supposed to add to
>>>> the initial definition.
>>>
>>> I want to calculate that with a general algorithm.
>>
>> I don't know what "that" refers to. Do you want to calculate the UTF-8
>> sequence length from the code point? It seems not. Do you want to
>> determine if a sequence is overlong by looking at the sequence? It
>> seems not. What is the algorithm given, and what it its result?
>
> I want to determine if a sequence is overlong using a generalized
> algorithm for that, ie, not by special-casing start byte values. So far,
> the untested (and very likely buggy) code for this looks like follows:
>
> u_len is the length of the sequence in bytes, p a pointer to the first
> byte. Some unrelated consistency checks removed.

At the moment, I'm convinced that this algorithm is complete
nonsense. :-)

Re: UTF-8 overlong encodings

<87ee68lyhn.fsf@bsb.me.uk>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16967&group=comp.unix.programmer#16967

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Mon, 20 Dec 2021 03:57:24 +0000
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <87ee68lyhn.fsf@bsb.me.uk>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="e17f8d50552ff36ee4a9bb14f45ada15";
logging-data="18373"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18NopOk2ncL3Rsk7sYziqKn7W9qab25dtg="
Cancel-Lock: sha1:lL3VlkpCS/vq6P4gEBvqrx1oAo0=
sha1:vtC/+hf2pu7wF1g+NJd0xcwYRXs=
X-BSB-Auth: 1.9271fe15926326c51397.20211220035724GMT.87ee68lyhn.fsf@bsb.me.uk

by: Ben Bacarisse - Mon, 20 Dec 2021 03:57 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>> Rainer Weikusat <rweikusat@talktalk.net> writes:
>
> [...]
>
>>>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>>>> ignoring the initial special case, the shift value relative to the start
>>>>> of the first six bit block for each encoded sequence is 8 -
>>>>> its length:
>>>>>
>>>>> 3 -> 5
>>>>> 4 -> 4
>>>>> 5 -> 3
>>>>> 6 -> 2
>>>>>
>>>>> Any corrections or other comments very much welcome.
>>>>
>>>> I was not sure what this part of the description was supposed to add to
>>>> the initial definition.
>>>
>>> I want to calculate that with a general algorithm.
>>
>> I don't know what "that" refers to. Do you want to calculate the UTF-8
>> sequence length from the code point? It seems not. Do you want to
>> determine if a sequence is overlong by looking at the sequence? It
>> seems not. What is the algorithm given, and what it its result?
>
> I want to determine if a sequence is overlong using a generalized
> algorithm for that, ie, not by special-casing start byte values.

I don't think I follow what you mean. Over long sequences are special
case so you have to special-case something. Why not the first byte? It
seems to be such a simple method.

> So far,
> the untested (and very likely buggy) code for this looks like follows:
>
> u_len is the length of the sequence in bytes,

How have you calculated u_len? You can detect and overlong sequence
without knowing it, so there is some risk in using it when it's not
needed.

> p a pointer to the first
> byte. Some unrelated consistency checks removed.
>
> mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */

That includes one more bit than you want. In a proper UTF-8 sequence,
that bit will be zero, so it's harmless, but have you already checked
that the sequence is valid (other than possibly being overlong).

By the way, I'd use 0xff >> u_len to get the mask. It seems more
natural.

> x = *p & mask;
> if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

(or if no bits are set, but you include that in your test)

> y = *++p;

I don't see why you need to look at the next byte.

> if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

> mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
> if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
> }

--
Ben.

Re: UTF-8 overlong encodings

<61c04fee$0$8884$426a74cc@news.free.fr>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16968&group=comp.unix.programmer#16968

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!npeer.as286.net!npeer-ng0.as286.net!proxad.net!feeder1-1.proxad.net!cleanfeed3-b.proxad.net!nnrp1-1.free.fr!not-for-mail
Newsgroups: comp.unix.programmer
From: nicolas$george@salle-s.org (Nicolas George)
Subject: Re: UTF-8 overlong encodings
Sender: george@phare.invalid (Nicolas George)
X-Newsreader: Flrn (0.9.20070704)
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com> <87lf0i7npv.fsf@nosuchdomain.example.com> <87ilvl6dbe.fsf@doppelsaurus.mobileactivedefense.com> <splvjm$msm$1@dont-email.me>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=iso-8859-1
Date: 20 Dec 2021 09:42:06 GMT
Lines: 8
Message-ID: <61c04fee$0$8884$426a74cc@news.free.fr>
Organization: Guest of ProXad - France
NNTP-Posting-Date: 20 Dec 2021 10:42:06 CET
NNTP-Posting-Host: 129.199.129.80
X-Trace: 1639993326 news-2.free.fr 8884 129.199.129.80:56380
X-Complaints-To: abuse@proxad.net

by: Nicolas George - Mon, 20 Dec 2021 09:42 UTC

John McCue , dans le message <splvjm$msm$1@dont-email.me>, a écrit :
> I saw somewhere 5 and 6 byte sequences were originally
> defined or thought it would be needed, but now limited
> to 4 bytes.

Unicode was limited to 20-21 bits because Microsoft and Sun decided to use
UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
to sizeof(wchar_t) or equivalent.

Re: UTF-8 overlong encodings

<61c08650$0$1355$426a74cc@news.free.fr>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16969&group=comp.unix.programmer#16969

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!feed.abavia.com!abe004.abavia.com!abe002.abavia.com!proxad.net!feeder1-1.proxad.net!cleanfeed2-b.proxad.net!nnrp1-2.free.fr!not-for-mail
Newsgroups: comp.unix.programmer
From: nicolas$george@salle-s.org (Nicolas George)
Subject: Re: UTF-8 overlong encodings
Sender: george@phare.invalid (Nicolas George)
X-Newsreader: Flrn (0.9.20070704)
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com> <87y24ird6y.fsf@bsb.me.uk> <87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com> <87sfuppku0.fsf@bsb.me.uk> <87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=iso-8859-1
Date: 20 Dec 2021 13:34:08 GMT
Lines: 6
Message-ID: <61c08650$0$1355$426a74cc@news.free.fr>
Organization: Guest of ProXad - France
NNTP-Posting-Date: 20 Dec 2021 14:34:08 CET
NNTP-Posting-Host: 129.199.129.80
X-Trace: 1640007248 news-2.free.fr 1355 129.199.129.80:41152
X-Complaints-To: abuse@proxad.net
X-Received-Bytes: 1340

by: Nicolas George - Mon, 20 Dec 2021 13:34 UTC

Rainer Weikusat , dans le message
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>, a écrit :
> I want to determine if a sequence is overlong using a generalized
> algorithm for that

Just decode and re-encode and see if the length is the same.

Re: UTF-8 overlong encodings

<87a6gvi10g.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16971&group=comp.unix.programmer#16971

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Mon, 20 Dec 2021 18:28:47 +0000
Lines: 75
Message-ID: <87a6gvi10g.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
<87ee68lyhn.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net U50dpnv5wQ1zNbYs6wtrTgtcSscbPUkN67RN+1Yt2gcIuaZFE=
Cancel-Lock: sha1:8fWuZyt84snjYGOYhmCbmVdawsU= sha1:GKMqxKCwcBtDWN9w5zpOujII+8o=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Mon, 20 Dec 2021 18:28 UTC

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Rainer Weikusat <rweikusat@talktalk.net> writes:
>> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

[...]

>> u_len is the length of the sequence in bytes,
>
> How have you calculated u_len? You can detect and overlong sequence
> without knowing it, so there is some risk in using it when it's not
> needed.

Assuming x is the start byte of a UTF-8 sequences stored as unsigned
32-bit integer, the length of the sequence is (using a gcc extension)

__builtin_clz(x ^ 0xff) - 24

All bits left of 0x80 will already be clear. x ^ 0xff will clear all
bits up to the trailing 0 bit of the prefix and set that to 1.

>
>> p a pointer to the first
>> byte. Some unrelated consistency checks removed.
>>
>> mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
>
> That includes one more bit than you want.

Indeed. 8 - u_len is the shift index of the last non-zero prefix bit. It
should have been 7 - u_len or (8 - (u_len + 1)).

[...]

> I don't see why you need to look at the next byte.
>
>> if (!x) { /* x == 0 implies u_len > 2 */
>
> x == 0 implies an overlong sequence now that you have dealt with the
> length 2 case which can have one bit on x set and still be overlong.

According to the Linux man page, a number in the range 0x800 - 0xffff is
encoded as three bytes:

1110xxxx 10xxxxxx 10xxxxxx

Program encoding 0x800 in this way:

--------
#include <stdio.h>

int main(void)
{ unsigned u = 0x800;
unsigned char utf[3], *p;

p = utf;
*p++ = 0xe0 | (u >> 12);
*p++ = 0x80 | ((u >> 6) & 63);
*p = 0x80 | (u & 63);

printf("%x %x %x\n", *utf, utf[1], utf[2]);

return 0;
} -------

And the output is e0 a0 80. The situation is similar for all other
ranges, including the 4-byte sequence which is actually supposed to be
used: The first number of the range corresponds with a set bit in the second
byte.

I may again have gotten something wrong here. But I've done all the
calculations twice and got the same result.

Re: UTF-8 overlong encodings

<87tuf35cmk.fsf@nosuchdomain.example.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16972&group=comp.unix.programmer#16972

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Keith.S.Thompson+u@gmail.com (Keith Thompson)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Mon, 20 Dec 2021 10:56:19 -0800
Organization: None to speak of
Lines: 19
Message-ID: <87tuf35cmk.fsf@nosuchdomain.example.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87lf0i7npv.fsf@nosuchdomain.example.com>
<87ilvl6dbe.fsf@doppelsaurus.mobileactivedefense.com>
<splvjm$msm$1@dont-email.me> <61c04fee$0$8884$426a74cc@news.free.fr>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="c0b9eb1ffd5be6f83e3e816f89e74b3e";
logging-data="25471"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oNXyl/K7wL2K2UPkJQYwK"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:TukrbVXoCb7R/HWzTCkWLQ/696o=
sha1:cw/kOBcZf5e7GOkzx+m2aRpleFA=

by: Keith Thompson - Mon, 20 Dec 2021 18:56 UTC

Nicolas George <nicolas$george@salle-s.org> writes:
> John McCue , dans le message <splvjm$msm$1@dont-email.me>, a écrit :
>> I saw somewhere 5 and 6 byte sequences were originally
>> defined or thought it would be needed, but now limited
>> to 4 bytes.
>
> Unicode was limited to 20-21 bits because Microsoft and Sun decided to use
> UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
> to sizeof(wchar_t) or equivalent.

I don't see how that follows. UTF-16, at least in its current form, can
represent all 1,112,064 valid Unicode code points.

Sun used UTF-16 for Java. UTF-16, as far as I know, was rare on Solaris.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Re: UTF-8 overlong encodings

<875yrjhy5i.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16973&group=comp.unix.programmer#16973

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Mon, 20 Dec 2021 19:30:33 +0000
Lines: 27
Message-ID: <875yrjhy5i.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
<87ee68lyhn.fsf@bsb.me.uk>
<87a6gvi10g.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net Gk6SQidUzNy5wMIHF8swWwLmgTWwUIjMMABqF1yv1pNi9MpiY=
Cancel-Lock: sha1:scSpdpYJ+FgBwTFycIZpeWgmsjU= sha1:ALHEsb5wu7jTi0Aon0B3kyLIjB8=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Mon, 20 Dec 2021 19:30 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:
> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

[...]

>> I don't see why you need to look at the next byte.
>>
>>> if (!x) { /* x == 0 implies u_len > 2 */
>>
>> x == 0 implies an overlong sequence now that you have dealt with the
>> length 2 case which can have one bit on x set and still be overlong.
>
> According to the Linux man page, a number in the range 0x800 - 0xffff is
> encoded as three bytes:
>
> 1110xxxx 10xxxxxx 10xxxxxx
>
> Program encoding 0x800 in this way:

[...]

> And the output is e0 a0 80.

Addition: Not technically a proof of correctness but the ActionCable
(rot in hell) UTF-8 checker I have to placate accepts 0xe0 0xa0 0x80 as
valid UTF-8 sequence.

Re: UTF-8 overlong encodings

<8735mnm2ci.fsf@bsb.me.uk>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16974&group=comp.unix.programmer#16974

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Mon, 20 Dec 2021 20:46:21 +0000
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <8735mnm2ci.fsf@bsb.me.uk>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
<87ee68lyhn.fsf@bsb.me.uk>
<87a6gvi10g.fsf@doppelsaurus.mobileactivedefense.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="e17f8d50552ff36ee4a9bb14f45ada15";
logging-data="15500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19J29cpXlGpx2ewV4ONr+yxwAm+nVOtF+A="
Cancel-Lock: sha1:2C4GMEXwHiyahz0f0Dr2htrsJXo=
sha1:EACe5nmzt8oEqqbP9UDRIuKeWY8=
X-BSB-Auth: 1.7e2fb531e586a8dd1e22.20211220204621GMT.8735mnm2ci.fsf@bsb.me.uk

by: Ben Bacarisse - Mon, 20 Dec 2021 20:46 UTC

Rainer Weikusat <rweikusat@talktalk.net> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
<cut>
>> I don't see why you need to look at the next byte.
>>
>>> if (!x) { /* x == 0 implies u_len > 2 */
>>
>> x == 0 implies an overlong sequence now that you have dealt with the
>> length 2 case which can have one bit on x set and still be overlong.
>
> According to the Linux man page, a number in the range 0x800 - 0xffff is
> encoded as three bytes:

Yup. I was not thinking. You need to look at the first two bytes when
the sequence length is > 2.

If b1 is 0xE0 then b2 must be >= 0xA0.
If b1 is 0xF0 then b2 must be >= 0x90.
If b1 is 0xF8 then b2 must be >= 0x88.
If b1 is 0xFC then b2 must be >= 0x84.

In this diagram, the ^ marks the least significant bit that must be set
for the encoding not to be overlong:

110x xxxx 10xx xxxx
^
1110 xxxx 10xx xxxx 10xx xxxx
^
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
^
1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
^
1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
^
In terms of bit masks,

(b2 & 0x3f) >> 8 - ulen

must be non zero for ulen > 2.

Sorry for the noise.

--
Ben.

Re: UTF-8 overlong encodings

<871r27hsvw.fsf@doppelsaurus.mobileactivedefense.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=16975&group=comp.unix.programmer#16975

copy link Newsgroups: comp.unix.programmer

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rweikusat@talktalk.net (Rainer Weikusat)
Newsgroups: comp.unix.programmer
Subject: Re: UTF-8 overlong encodings
Date: Mon, 20 Dec 2021 21:24:19 +0000
Lines: 28
Message-ID: <871r27hsvw.fsf@doppelsaurus.mobileactivedefense.com>
References: <874k76zvrc.fsf@doppelsaurus.mobileactivedefense.com>
<87y24ird6y.fsf@bsb.me.uk>
<87mtkx6di7.fsf@doppelsaurus.mobileactivedefense.com>
<87sfuppku0.fsf@bsb.me.uk>
<87tuf446mh.fsf@doppelsaurus.mobileactivedefense.com>
<87ee68lyhn.fsf@bsb.me.uk>
<87a6gvi10g.fsf@doppelsaurus.mobileactivedefense.com>
<8735mnm2ci.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net LepXkfDyVZf5Nh/r6DiZ6gYgkT46U3ZL24L3eC20V272LGvjI=
Cancel-Lock: sha1:n+E3GeNrE1LUP1Ovp2r82hwbiIg= sha1:S1B34UT+ZPA11uJzrcFPpol773w=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

by: Rainer Weikusat - Mon, 20 Dec 2021 21:24 UTC

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Rainer Weikusat <rweikusat@talktalk.net> writes:
>> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> <cut>
>>> I don't see why you need to look at the next byte.
>>>
>>>> if (!x) { /* x == 0 implies u_len > 2 */
>>>
>>> x == 0 implies an overlong sequence now that you have dealt with the
>>> length 2 case which can have one bit on x set and still be overlong.
>>
>> According to the Linux man page, a number in the range 0x800 - 0xffff is
>> encoded as three bytes:

[...]

> In terms of bit masks,
>
> (b2 & 0x3f) >> 8 - ulen

Idea I had meanwhile myself, too: The expressions become simpler when shifting out the
unwanted bits instead of selecting the wanted ones via &-masking. The
one above is for the second byte, the one for the first would be

b1 << u_len

with the special case that a result of 4 means it's overlong when the
length of the sequence is 2.

Work continues in this area. -- DEC's SPR-Answering-Automaton

devel / comp.unix.programmer / Re: UTF-8 overlong encodings

Subject	Author
UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Ben Bacarisse
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Ben Bacarisse
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Ben Bacarisse
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Ben Bacarisse
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	Nicolas George
Re: UTF-8 overlong encodings	Keith Thompson
Re: UTF-8 overlong encodings	Rainer Weikusat
Re: UTF-8 overlong encodings	John McCue
Re: UTF-8 overlong encodings	Nicolas George
Re: UTF-8 overlong encodings	Keith Thompson