Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

One man's constant is another man's variable. -- A. J. Perlis


devel / comp.lang.tcl / Re: possible bug in "regexp" syntax ".*?"

SubjectAuthor
* possible bug in "regexp" syntax ".*?"aotto1968
+- Re: possible bug in "regexp" syntax ".*?"aotto1968
`* Re: possible bug in "regexp" syntax ".*?"Christian Gollwitzer
 +* Re: possible bug in "regexp" syntax ".*?"heinrichmartin
 |`- Re: possible bug in "regexp" syntax ".*?"Rich
 `- Re: possible bug in "regexp" syntax ".*?"Oleg Nemanov

1
possible bug in "regexp" syntax ".*?"

<t9aalb$61p$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=19580&group=comp.lang.tcl#19580

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: aotto1968@t-online.de (aotto1968)
Newsgroups: comp.lang.tcl
Subject: possible bug in "regexp" syntax ".*?"
Date: Sun, 26 Jun 2022 21:05:14 +0200
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <t9aalb$61p$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 26 Jun 2022 19:05:15 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="bfe38a620c8ff2633dc55c0f66ff89df";
logging-data="6201"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cpSTv6+owrZ6uNxC6qk+JZGo/hm4i12g="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:jFudtsDoW0U0DmQNHdysZw84+T4=
Content-Language: en-US
 by: aotto1968 - Sun, 26 Jun 2022 19:05 UTC

Hi,

my code

# @cartouche
while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end
smallexample[^@]+@end cartouche} $txt allI valI]} {
set val [string trim [string range $txt {*}$valI]]
set val [string map { \\@\{ @\{ } $val ]
set txt [string replace $txt {*}$allI "~~~~~\n$val\n~~~~~" ]
}

with data from:

@cartouche
@smallexample
# file: quote.cfg
quote = "Criticism may not be agreeable, but it is necessary."
" It fulfils the same function as pain in the human"
" body. It calls attention to an unhealthy state of"
" things.\n"
"\t--Winston Churchill";
@end smallexample
@end cartouche

@cartouche
@smallexample
# file: test.cfg
info: @{
name = "Winston Churchill";
@@include "quote.cfg"
country = "UK";
@};
@end smallexample
@end cartouche

create the *false* match:

~~~~~
# file: quote.cfg
quote = "Criticism may not be agreeable, but it is necessary."
" It fulfils the same function as pain in the human"
" body. It calls attention to an unhealthy state of"
" things.\n"
"\t--Winston Churchill";
@end smallexample
@end cartouche

@cartouche
@smallexample
# file: test.cfg
info: @{
name = "Winston Churchill";
@@include "quote.cfg"
country = "UK";
@};
~~~~~

because of error in ".*?" → should find the *smalles* match.
the *good* solution would be:

~~~~~
# file: quote.cfg
quote = "Criticism may not be agreeable, but it is necessary."
" It fulfils the same function as pain in the human"
" body. It calls attention to an unhealthy state of"
" things.\n"
"\t--Winston Churchill";
~~~~~

~~~~~
# file: test.cfg
info: @{
name = "Winston Churchill";
@@include "quote.cfg"
country = "UK";
@};
~~~~~

example online: https://regex101.com/r/AWkpiL/1

mfg

Re: possible bug in "regexp" syntax ".*?"

<t9abg3$9b2$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=19581&group=comp.lang.tcl#19581

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: aotto1968@t-online.de (aotto1968)
Newsgroups: comp.lang.tcl
Subject: Re: possible bug in "regexp" syntax ".*?"
Date: Sun, 26 Jun 2022 21:19:31 +0200
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <t9abg3$9b2$1@dont-email.me>
References: <t9aalb$61p$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 26 Jun 2022 19:19:31 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="bfe38a620c8ff2633dc55c0f66ff89df";
logging-data="9570"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18+i0RFY+zhH74VCIKrWsHvbhUfUueCFe0="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:ptU8+L79f1tA/7Bgk31WgLY2g4M=
In-Reply-To: <t9aalb$61p$1@dont-email.me>
Content-Language: en-US
 by: aotto1968 - Sun, 26 Jun 2022 19:19 UTC

Just as info, the following code do the job (split-by-hand)

# @cartouche
while {[regexp -indices {@cartouche[^@]+@smallexample} $txt allI]} {
lassign $allI start1I start2I
if {![regexp -indices -start $start2I
{@end\s+smallexample[^@]+@end\s+cartouche} $txt allI]} {
error "don't find end of '@cartouche'"
}
lassign $allI end1I end2I
set val [string trim [string range $txt $start2I+1 $end1I-1]]
set val [string map { \\@\{ @\{ } $val ]
set txt [string replace $txt $start1I $end2I "~~~~~\n$val\n~~~~~" ]
}

mfg

Re: possible bug in "regexp" syntax ".*?"

<t9acn7$dtm$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=19582&group=comp.lang.tcl#19582

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: auriocus@gmx.de (Christian Gollwitzer)
Newsgroups: comp.lang.tcl
Subject: Re: possible bug in "regexp" syntax ".*?"
Date: Sun, 26 Jun 2022 21:40:22 +0200
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <t9acn7$dtm$1@dont-email.me>
References: <t9aalb$61p$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 26 Jun 2022 19:40:23 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="975c3cac725442a59a756fd488095c49";
logging-data="14262"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19idoBwJXbOjOjtKG5Y1SVz2AyfY5PcJig="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:91.0)
Gecko/20100101 Thunderbird/91.10.0
Cancel-Lock: sha1:bLLINh6CjTaud+oqIFfWPdhtCsA=
In-Reply-To: <t9aalb$61p$1@dont-email.me>
 by: Christian Gollwitzer - Sun, 26 Jun 2022 19:40 UTC

Am 26.06.22 um 21:05 schrieb aotto1968:
> Hi,
>
> my code
>
>   # @cartouche
>   while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end
> smallexample[^@]+@end cartouche} $txt allI valI]} {

I haven't checked the details, but your RE mixes greedy (+) and
non-greedy (.*?) quantifiers. AFAIK, this can lead to hard to understand
behaviour. Quote from the manpage:
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.html
"The matching rules for REs containing both normal and non-greedy
quantifiers have changed since early beta-test versions of this package.
(The new rules are much simpler and cleaner, but do not work as hard at
guessing the user's real intentions.) "

The page you used to test the RE uses PCRE, the most widely known RE
engine, as opposed to Henry Spencer's RE engine as used by Tcl. In
certain corner cases (esp. for maliciously crafted REs), the algorithm
used by PCRE can be exponentially slow whereas Spencer's engine is
polynomial, however, on average for non-malicious REs PCRE is faster.

One could argue that Tcl 9 should migrate to PCRE and drop Spencer's RE,
which is also badly maintained. Someone with more moment than me could
write a TIP about it ;)

Christian

Re: possible bug in "regexp" syntax ".*?"

<314ebf87-3bc7-4ff3-bc1b-58a7f79f2f59n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=19583&group=comp.lang.tcl#19583

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:6000:695:b0:21a:3a1a:7b60 with SMTP id bo21-20020a056000069500b0021a3a1a7b60mr10232374wrb.441.1656309306277;
Sun, 26 Jun 2022 22:55:06 -0700 (PDT)
X-Received: by 2002:a9d:32f:0:b0:616:aa29:291d with SMTP id
44-20020a9d032f000000b00616aa29291dmr5161092otv.312.1656309305589; Sun, 26
Jun 2022 22:55:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Sun, 26 Jun 2022 22:55:05 -0700 (PDT)
In-Reply-To: <t9acn7$dtm$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=84.115.233.51; posting-account=Od2xOAoAAACEyRX3Iu5rYt4oevuoeYUG
NNTP-Posting-Host: 84.115.233.51
References: <t9aalb$61p$1@dont-email.me> <t9acn7$dtm$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <314ebf87-3bc7-4ff3-bc1b-58a7f79f2f59n@googlegroups.com>
Subject: Re: possible bug in "regexp" syntax ".*?"
From: martin.heinrich@frequentis.com (heinrichmartin)
Injection-Date: Mon, 27 Jun 2022 05:55:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: heinrichmartin - Mon, 27 Jun 2022 05:55 UTC

On Sunday, June 26, 2022 at 9:40:27 PM UTC+2, Christian Gollwitzer wrote:
> Am 26.06.22 um 21:05 schrieb aotto1968:
> > Hi,
> >
> > my code
> >
> > # @cartouche
> > while {[regexp -indices {@cartouche[^@]+@smallexample(.*?)@end
> > smallexample[^@]+@end cartouche} $txt allI valI]} {
> I haven't checked the details, but your RE mixes greedy (+) and
> non-greedy (.*?) quantifiers. AFAIK, this can lead to hard to understand
> behaviour. Quote from the manpage:
> https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.html
> "The matching rules for REs containing both normal and non-greedy
> quantifiers have changed since early beta-test versions of this package.
> (The new rules are much simpler and cleaner, but do not work as hard at
> guessing the user's real intentions.) "

You are looking for the description of the _preference_ of a RE. Take-away message: you cannot mix greedy and non-greedy matching; and the *first* quantifier determines the preference of the full RE.

I often fall back to _not_ create the super-duper long regexp that matches the whole thing from a stream (I am using Expect a lot), but match the header first (anchored at the start of the buffer, discarding junk), then find the matching footer; i.e. when magic[1] does not help, go the extra mile and implement that state-machine.

Having that said, you could also look into packages that facilitate parsing (lexers). There used to be fickle/fcl (that was used by tcldoc; I am writing in past-tense, because a quick web search did not show their current homes).

[1] https://xkcd.com/208/

Re: possible bug in "regexp" syntax ".*?"

<t9cchq$h9m5$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=19584&group=comp.lang.tcl#19584

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: possible bug in "regexp" syntax ".*?"
Date: Mon, 27 Jun 2022 13:49:46 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <t9cchq$h9m5$1@dont-email.me>
References: <t9aalb$61p$1@dont-email.me> <t9acn7$dtm$1@dont-email.me> <314ebf87-3bc7-4ff3-bc1b-58a7f79f2f59n@googlegroups.com>
Injection-Date: Mon, 27 Jun 2022 13:49:46 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="08355d0d58abe75e168ccb6b112d5791";
logging-data="566981"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19/X9j+7wdR9/uYrhIOu1M/"
User-Agent: tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/3.10.17 (x86_64))
Cancel-Lock: sha1:C7NMeoloBftzoz+FTKwhyzzx/Cs=
 by: Rich - Mon, 27 Jun 2022 13:49 UTC

heinrichmartin <martin.heinrich@frequentis.com> wrote:
> Having that said, you could also look into packages that facilitate
> parsing (lexers). There used to be fickle/fcl (that was used by
> tcldoc; I am writing in past-tense, because a quick web search did
> not show their current homes).
>
> [1] https://xkcd.com/208/

Tcllib's PEG parser generator is relatively easy to use, once one
grasps the syntax.

http://tmml.sourceforge.net/doc/tcllib/peg.html

There was also a nice example posted here to the group some months back
for a PEG grammar. I forget now who posted it.

Re: possible bug in "regexp" syntax ".*?"

<fd9b91f8-0d06-474a-924a-262b00938b56n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=19666&group=comp.lang.tcl#19666

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:620a:450c:b0:6b2:59b8:985 with SMTP id t12-20020a05620a450c00b006b259b80985mr10924311qkp.328.1657543448973;
Mon, 11 Jul 2022 05:44:08 -0700 (PDT)
X-Received: by 2002:a05:6830:18f4:b0:61c:341b:16df with SMTP id
d20-20020a05683018f400b0061c341b16dfmr5804970otf.46.1657543448720; Mon, 11
Jul 2022 05:44:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Mon, 11 Jul 2022 05:44:08 -0700 (PDT)
In-Reply-To: <t9acn7$dtm$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=194.190.114.28; posting-account=RPJNegoAAAAUgD_yLdrci9D1ZtZ1oI0L
NNTP-Posting-Host: 194.190.114.28
References: <t9aalb$61p$1@dont-email.me> <t9acn7$dtm$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fd9b91f8-0d06-474a-924a-262b00938b56n@googlegroups.com>
Subject: Re: possible bug in "regexp" syntax ".*?"
From: oleg.o.nemanov@gmail.com (Oleg Nemanov)
Injection-Date: Mon, 11 Jul 2022 12:44:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1631
 by: Oleg Nemanov - Mon, 11 Jul 2022 12:44 UTC

воскресенье, 26 июня 2022 г. в 22:40:27 UTC+3, Christian Gollwitzer:
> One could argue that Tcl 9 should migrate to PCRE and drop Spencer's RE,
> which is also badly maintained. Someone with more moment than me could
> write a TIP about it ;)

No-no. Let's stay RE engine as is :-). If someone want to use PCRE he can do this
with external lib even now.

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor