Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Crazee Edeee, his prices are INSANE!!!


devel / comp.compilers / Re: Interesting paper on regex NFA matching

SubjectAuthor
* Re: Interesting paper on regex NFA matchingChristopher F Clark
`- Re: Interesting paper on regex NFA matchingKaz Kylheku

1
Re: Interesting paper on regex NFA matching

<24-01-006@comp.compilers>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=880&group=comp.compilers#880

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christopher.f.clark@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: Re: Interesting paper on regex NFA matching
Date: Sat, 27 Jan 2024 12:47:33 +0200
Organization: Compilers Central
Sender: johnl%iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <24-01-006@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="71672"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, NFA
Posted-Date: 27 Jan 2024 12:52:49 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Sat, 27 Jan 2024 10:47 UTC

Kaz Kylheku <433-929-6894@kylheku.com> wrote:

> (I personaly am not so interested in backtracking regexes; and they are
> not directly applicable to compilers. You'd never want to design a
> lexical analyzer that requires "lookaround" or whatnot in order to
> recognize tokens. It's nice they are generating Arxiv papers for
> someone; no genuine intellectual inquiry is invalid; good luck! It's
> not applicable to the "Programming Languages" category it has been filed
> under. I will skim through it, though.)

You are mistaken that they are not applicable to programming languages.

Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.

The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.

In any case, parser generators like ANTLR use backtracking to great
effect (and yes, we "borrowed" that idea in Yacc++) and so did the
entire PEG (parsing expression grammar) community, using memoizartion
to give linear time performance in many cases.

Thus, to me, it is no surprise to see this reintegrated back into the
PCRE world.

Moreover, even if it is of little interest [mistakenly in my opinion] to the
"programming language" community, it is of definite interest in the
"real world:" PCRE regular expressions are both used (and abused)
in many applications, including SNORT where linear performance and
preventing DDOS attacks would be of prime importance. (While at Intel,
I had the honor of working with the Hyperscan (Sensory Networks) team
and improving PCRE performance for SNORT was a major "selling point"

And "greedy" (longest match), "non-greedy" (shortest match) and other
variations like "lexically first in the grammar" matching all have their
applications. Saying that only one of them is correct is ignoring the
needs of actual users.

And, yes, we had discussions about "automata theory correct" interpretations
of what certain PCRE expressions meant and how even "canonical"
implementations had various "bugs" that made their meaning ambiguous.

By the way, to give credit where due, we "borrowed" the idea of
"tunnel automatons" from Josef Grosch to implement controlled
backtracking in Yacc++. It lets one do subset construction (i.e.
the LR(0) algorithm) for items, breaking out for "syntactic predicates"
exactly as needed.

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: Interesting paper on regex NFA matching

<24-01-007@comp.compilers>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=881&group=comp.compilers#881

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 433-929-6894@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: Interesting paper on regex NFA matching
Date: Sat, 27 Jan 2024 20:20:29 -0000
Organization: Compilers Central
Sender: johnl%iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <24-01-007@comp.compilers>
References: <24-01-006@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="81627"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 27 Jan 2024 22:19:04 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Sat, 27 Jan 2024 20:20 UTC

On 2024-01-27, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> Kaz Kylheku <433-929-6894@kylheku.com> wrote:
>
>> (I personaly am not so interested in backtracking regexes; and they are
>> not directly applicable to compilers. You'd never want to design a
>> lexical analyzer that requires "lookaround" or whatnot in order to
>> recognize tokens. It's nice they are generating Arxiv papers for
>> someone; no genuine intellectual inquiry is invalid; good luck! It's
>> not applicable to the "Programming Languages" category it has been filed
>> under. I will skim through it, though.)
>
> You are mistaken that they are not applicable to programming languages.
>
> Even Pascal [originally] had cases where lookahead and backtracking
> are required to properly tokenize the input.
>
> The well-known case is "1." if followed by another "." it is two tokens,
> and if followed by a digit it is a floating point number.
> I don't recall what it should be if followed by any other character.

I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.

{digits}/. {
// ... set up semantic value ..
return INTEGER;
}

{digits}.{digits} {
// ... set up semantic value ..
return FLOAT;
}

Another possibility is to recongize {digits}. as one lexeme and call
unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor