Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Always look over your shoulder because everyone is watching and plotting against you.


devel / comp.lang.mumps / Adventures with UTF-8 performance

SubjectAuthor
* Adventures with UTF-8 performanceJens Lideström
`* Re: Adventures with UTF-8 performanceOnix Man
 `* Re: Adventures with UTF-8 performanceJens Lideström
  `* Re: Adventures with UTF-8 performanceAlex Maslov
   `- Re: Adventures with UTF-8 performanceJens Lideström

1
Adventures with UTF-8 performance

<f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=468&group=comp.lang.mumps#468

  copy link   Newsgroups: comp.lang.mumps
X-Received: by 2002:a05:622a:15c8:b0:39c:ea8a:82e3 with SMTP id d8-20020a05622a15c800b0039cea8a82e3mr14132971qty.146.1667295104364;
Tue, 01 Nov 2022 02:31:44 -0700 (PDT)
X-Received: by 2002:ac8:5d49:0:b0:399:c50c:7171 with SMTP id
g9-20020ac85d49000000b00399c50c7171mr14249704qtx.564.1667295104175; Tue, 01
Nov 2022 02:31:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.mumps
Date: Tue, 1 Nov 2022 02:31:43 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=81.232.46.228; posting-account=e6ZTGwoAAAD0RJ1tKSnsnYuh1JIHXrm6
NNTP-Posting-Host: 81.232.46.228
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com>
Subject: Adventures with UTF-8 performance
From: jens.lidestrom@vgregion.se (Jens Lideström)
Injection-Date: Tue, 01 Nov 2022 09:31:44 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3947
 by: Jens Lideström - Tue, 1 Nov 2022 09:31 UTC

I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.

### The problem

We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.

### Incorrect theory

At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:

set result=result_stringForTheCurrentIteration

We though this resulted in a O(n^2) execution time in the length of the string.

### Discoveries

However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!

We also learned more about M performance from this extremely interesting post by Bhaskar:

https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ

### Correct diagnosis

Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.

The following code will have O(n) performance in the length of the string (and the value of `index`):

s ch=$extract(largeString,index)

This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start from the beginning and traverse all the characters to find the byte position of the one that it should return.

GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.

### Solution

Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all cases.

The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.

All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.

Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding

In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.

In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

Re: Adventures with UTF-8 performance

<7e7d83b3-72bc-4efc-8d4c-ca875d77cd99n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=469&group=comp.lang.mumps#469

  copy link   Newsgroups: comp.lang.mumps
X-Received: by 2002:a05:620a:178a:b0:6fa:94ff:1b1f with SMTP id ay10-20020a05620a178a00b006fa94ff1b1fmr348404qkb.337.1667540596682;
Thu, 03 Nov 2022 22:43:16 -0700 (PDT)
X-Received: by 2002:a05:620a:1597:b0:6fa:311a:933c with SMTP id
d23-20020a05620a159700b006fa311a933cmr17388949qkk.741.1667540596488; Thu, 03
Nov 2022 22:43:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.mumps
Date: Thu, 3 Nov 2022 22:43:16 -0700 (PDT)
In-Reply-To: <f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=109.62.186.81; posting-account=idTCWgoAAADy7lFT_C8N-jxNr2jz6q4P
NNTP-Posting-Host: 109.62.186.81
References: <f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7e7d83b3-72bc-4efc-8d4c-ca875d77cd99n@googlegroups.com>
Subject: Re: Adventures with UTF-8 performance
From: a144312645@gmail.com (Onix Man)
Injection-Date: Fri, 04 Nov 2022 05:43:16 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4611
 by: Onix Man - Fri, 4 Nov 2022 05:43 UTC

вторник, 1 ноября 2022 г. в 12:31:45 UTC+3, jens.li...@vgregion.se:
> I want to describe our problems and solutions with UTF-8 performance. Maybe this will be useful for someone else in the community.
>
> ### The problem
>
> We started experiencing performance problems when large messages were sent to and from our M application. Our application builds and parses strings containing the full JSON data of network messages.
>
> ### Incorrect theory
>
> At first we though the problem was the string building itself. Our code concatenates strings in the conventional M manner:
>
> set result=result_stringForTheCurrentIteration
>
> We though this resulted in a O(n^2) execution time in the length of the string.
>
> ### Discoveries
>
> However, experiments made us realise that the string building code was indeed not the problem. We noted that our code was only slow when the input messages contained non-ASCII characters!
>
> We also learned more about M performance from this extremely interesting post by Bhaskar:
>
> https://groups.google.com/g/comp.lang.mumps/c/MSVKLt0X6R4/m/zqBx52MTAgAJ
>
> ### Correct diagnosis
>
> Instead, we figured out that it is the string manipulation routines in M that are slow for large non-ASCII strings.
>
> The following code will have O(n) performance in the length of the string (and the value of `index`):
>
> s ch=$extract(largeString,index)
>
> This is not surprising. Characters in a UTF-8 encoded string have a variable byte length: ASCII characters are 1 byte, other characters consists of 2-4 bytes. To find the character at a particular index the implementation of `$extract` has to start from the beginning and traverse all the characters to find the byte position of the one that it should return.
>
> GT.M seems to have an optimisation so that if a string consists of only ASCII characters then `$extract` can fetch characters in O(1) time.
>
> ### Solution
>
> Our solution is to switch from `$extract` and `$length` and instead use `$zextract` and `$zlength` for string manipulation. The Z variants ignore UTF-8 and treats strings as sequences of bytes. Because of that `$extract` can work in O(1) time in all cases.
>
> The complication with this solution is that we have write code to handle multi-byte characters ourselves. Fortunately this turned out to be pretty simple in our case.
>
> All bytes of multi-byte UTF-8 characters have a value that is 128 or larger, while a 1-byte character has a value that is 127 or smaller. Because of this it is easy to distinguish them.
>
> Have a look at the Wikipedia page for an explanation: https://en.wikipedia.org/wiki/UTF-8#Encoding
>
> In our case we have to examine the 1-byte characters to generate correct JSON. The multi-byte characters however can be simply copied byte-by-byte to the output.
>
> In this way we have obtained a O(n) execution time of our JSON generation and parsing routines.

it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

Re: Adventures with UTF-8 performance

<96fbfea3-2f30-4539-9c27-c90536bd40a5n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=470&group=comp.lang.mumps#470

  copy link   Newsgroups: comp.lang.mumps
X-Received: by 2002:a05:6214:20c1:b0:4b9:f285:de7e with SMTP id 1-20020a05621420c100b004b9f285de7emr30580769qve.14.1667548969395;
Fri, 04 Nov 2022 01:02:49 -0700 (PDT)
X-Received: by 2002:ac8:1102:0:b0:3a5:3f5d:2e06 with SMTP id
c2-20020ac81102000000b003a53f5d2e06mr177453qtj.272.1667548855222; Fri, 04 Nov
2022 01:00:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.mumps
Date: Fri, 4 Nov 2022 01:00:54 -0700 (PDT)
In-Reply-To: <7e7d83b3-72bc-4efc-8d4c-ca875d77cd99n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=81.232.46.228; posting-account=e6ZTGwoAAAD0RJ1tKSnsnYuh1JIHXrm6
NNTP-Posting-Host: 81.232.46.228
References: <f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com> <7e7d83b3-72bc-4efc-8d4c-ca875d77cd99n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <96fbfea3-2f30-4539-9c27-c90536bd40a5n@googlegroups.com>
Subject: Re: Adventures with UTF-8 performance
From: jens.lidestrom@vgregion.se (Jens Lideström)
Injection-Date: Fri, 04 Nov 2022 08:02:49 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1665
 by: Jens Lideström - Fri, 4 Nov 2022 08:00 UTC

On Friday, November 4, 2022 at 6:43:17 AM UTC+1, a1443...@gmail.com wrote:
> it would be faster if the internal strings used UCS2 encoding(char16_t) in this case, there is no need to produce $length and $z duplicates functions. But gtm/db go your own way.

I don't think it would be faster.

With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

Re: Adventures with UTF-8 performance

<1e3afd37-9bc1-4b59-8315-3a4372ed40abn@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=474&group=comp.lang.mumps#474

  copy link   Newsgroups: comp.lang.mumps
X-Received: by 2002:ac8:5ad1:0:b0:3a4:ffff:8c59 with SMTP id d17-20020ac85ad1000000b003a4ffff8c59mr43157608qtd.57.1667924621550;
Tue, 08 Nov 2022 08:23:41 -0800 (PST)
X-Received: by 2002:a05:620a:6ca:b0:6ec:553a:cf33 with SMTP id
10-20020a05620a06ca00b006ec553acf33mr38645074qky.132.1667924621347; Tue, 08
Nov 2022 08:23:41 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.mumps
Date: Tue, 8 Nov 2022 08:23:41 -0800 (PST)
In-Reply-To: <96fbfea3-2f30-4539-9c27-c90536bd40a5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.90.188.3; posting-account=xUQXIQoAAABX5nmal2xpcC5SP4rTX5l7
NNTP-Posting-Host: 5.90.188.3
References: <f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com>
<7e7d83b3-72bc-4efc-8d4c-ca875d77cd99n@googlegroups.com> <96fbfea3-2f30-4539-9c27-c90536bd40a5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1e3afd37-9bc1-4b59-8315-3a4372ed40abn@googlegroups.com>
Subject: Re: Adventures with UTF-8 performance
From: maslov70@gmail.com (Alex Maslov)
Injection-Date: Tue, 08 Nov 2022 16:23:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1811
 by: Alex Maslov - Tue, 8 Nov 2022 16:23 UTC

пятница, 4 ноября 2022 г. в 11:02:50 UTC+3, jens.li...@vgregion.se:
> I don't think it would be faster.
>
> With UTF-16 you still has to deal with surrogate pairs to represent all Unicode.

Surrogate pairs are rear guests in European languages at least.
Having some experience in both worlds (ISC and GT.M) I can confirm that ISC's approach based on UCS-2 provides much less extra expenses for Unicode support than GT.M's one.

Re: Adventures with UTF-8 performance

<d3ce1b65-a79e-46b5-8c74-08b3b1c4c2e4n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=476&group=comp.lang.mumps#476

  copy link   Newsgroups: comp.lang.mumps
X-Received: by 2002:ae9:f309:0:b0:6fa:8b0b:10c9 with SMTP id p9-20020ae9f309000000b006fa8b0b10c9mr19330242qkg.732.1668601280070;
Wed, 16 Nov 2022 04:21:20 -0800 (PST)
X-Received: by 2002:a05:6214:5b89:b0:4bb:6b72:3c36 with SMTP id
lq9-20020a0562145b8900b004bb6b723c36mr20423698qvb.121.1668601279965; Wed, 16
Nov 2022 04:21:19 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.mumps
Date: Wed, 16 Nov 2022 04:21:19 -0800 (PST)
In-Reply-To: <1e3afd37-9bc1-4b59-8315-3a4372ed40abn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=98.128.175.27; posting-account=e6ZTGwoAAAD0RJ1tKSnsnYuh1JIHXrm6
NNTP-Posting-Host: 98.128.175.27
References: <f75aa132-d3c4-4f1c-8c98-645eb43e2bd3n@googlegroups.com>
<7e7d83b3-72bc-4efc-8d4c-ca875d77cd99n@googlegroups.com> <96fbfea3-2f30-4539-9c27-c90536bd40a5n@googlegroups.com>
<1e3afd37-9bc1-4b59-8315-3a4372ed40abn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d3ce1b65-a79e-46b5-8c74-08b3b1c4c2e4n@googlegroups.com>
Subject: Re: Adventures with UTF-8 performance
From: jens.lidestrom@vgregion.se (Jens Lideström)
Injection-Date: Wed, 16 Nov 2022 12:21:20 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1679
 by: Jens Lideström - Wed, 16 Nov 2022 12:21 UTC

It's interesting to hear about your experience, Alex!

I guess it comes down to whether $length and buddies must give the correct result in the presence of surrogate pairs, or it's acceptable to assume that every character is 2 bytes.

Java, for example, uses UTF-16 in it's string interface and might give weird results for surrogate pairs.

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor