Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Per buck you get more computing action with the small computer. -- R. W. Hamming


devel / comp.compilers / Re: Lexing Unicode strings?

SubjectAuthor
o Re: Lexing Unicode strings?Hans Aberg

1
Re: Lexing Unicode strings?

<21-07-002@comp.compilers>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=59&group=comp.compilers#59

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: haberg-news@telia.com (Hans Aberg)
Newsgroups: comp.compilers
Subject: Re: Lexing Unicode strings?
Date: Wed, 14 Jul 2021 15:39:25 -0400 (EDT)
Organization: A noiseless patient Spider
Lines: 21
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <21-07-002@comp.compilers>
References: <21-05-001@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="96033"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, i18n
Posted-Date: 14 Jul 2021 15:39:25 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <21-05-001@comp.compilers>
Content-Language: en-US
 by: Hans Aberg - Wed, 14 Jul 2021 19:39 UTC

On 2021-05-04 01:58, John Levine wrote:
> [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
> is a prefix or suffix of any other UTF-8 encoding, you can lex them
> the same way you'd lex strings of ASCII. In that example above, \xCE,
> \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
> a multi-byte sequence, so if they do, you can put a wildcard . at the
> end to match bogus bytes and complain about an invalid character. Dunno
> what you mean about not always UTF-8; I realize there are mislabeled
> files of UTF-16 that you have to sort out by sniffing the BOM at the
> front, but you do that and turn whatever you're getting into UTF-8
> and then feed it to the lexer.
>
> I agree that lexing Unicode is not a solved problem, and I'm not
> aware of any really good ways to limit the table sizes. -John]

I wrote code, in Haskell and C++, that translates Unicode character
classes into byte classes. From a theoretical standpoint, a Unicode
regular language mapped under UTF-8 is a byte regular language, so it is
possible. So the 2^8 = 256 size tables that Flex uses is enough. The
Flex manual has an example how to make a regular expression replacing
its dot '.' to pick up all legal UTF-8 bytes.


devel / comp.compilers / Re: Lexing Unicode strings?

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor