RetroBBS - comp.lang.awk - Re: Gawk IGNORECASE=0 vs =1

Gawk IGNORECASE=0 vs =1

<a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>

https://www.rocksolidbbs.com/devel/article-flat.php?id=1119&group=comp.lang.awk#1119

X-Received: by 2002:adf:f007:0:b0:1ed:b04d:4f2d with SMTP id j7-20020adff007000000b001edb04d4f2dmr1105814wro.263.1645652080854;
Wed, 23 Feb 2022 13:34:40 -0800 (PST)
X-Received: by 2002:a0d:f105:0:b0:2d1:1f59:80fc with SMTP id
a5-20020a0df105000000b002d11f5980fcmr1452051ywf.77.1645652080297; Wed, 23 Feb
2022 13:34:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Wed, 23 Feb 2022 13:34:40 -0800 (PST)
Injection-Info: google-groups.googlegroups.com; posting-host=96.255.232.150; posting-account=BcR7vAoAAABY9YgIIYIhD68t7wwjMvJW
NNTP-Posting-Host: 96.255.232.150
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
Subject: Gawk IGNORECASE=0 vs =1
From: jnaman2@gmail.com (J Naman)
Injection-Date: Wed, 23 Feb 2022 21:34:40 +0000
Content-Type: text/plain; charset="UTF-8"

by: J Naman - Wed, 23 Feb 2022 21:34 UTC

Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference

Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.

* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.

Re: Gawk IGNORECASE=0 vs =1

<sv6agm$8lf$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1121&group=comp.lang.awk#1121

copy link Newsgroups: comp.lang.awk

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: Gawk IGNORECASE=0 vs =1
Date: Wed, 23 Feb 2022 15:55:33 -0600
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <sv6agm$8lf$1@dont-email.me>
References: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 23 Feb 2022 21:55:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="23986bf503f4f976738731d21c8a5f25";
logging-data="8879"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18UMul4aqlDE7Mau7kgU7wF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:FpYUpkgRzxTSCJ6q2wIjswk9I1I=
In-Reply-To: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 220223-4, 2/23/2022), Outbound message

by: Ed Morton - Wed, 23 Feb 2022 21:55 UTC

On 2/23/2022 3:34 PM, J Naman wrote:
> Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
> The result is no surprise, 128% difference for this one benchmark.
> I am just reporting the quantitative difference. The wall clock time difference
> can be non-trivial for processing large files having hundreds of
> thousands of lines of text.
> The benchmark was for a set of six statements:
> four of the form:
> if(str~/^text/) {return;}
> plus two statements:
> if(str~/^[A-Z]+$/) {return;}
> if(str~/^[a-z]+$/) {return;}
> The results over a aggregate 3 million loops:
> Score Test
> 817 Avg IGN=1
> 640 Avg IGN=0
> 128% difference
>
> Also, I reran the same test with "? at the beginning
> and end of all six regexps. The scores were
> not significantly different than the above.
>
> * These Scores are scaled. I was previously warned
> not to report actual CPU or clock times for one particular system.

Were there 128% more matches or some other difference in the matched
strings? Without knowing what the input contained it's hard to know what
those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
this information.

Ed.

Re: Gawk IGNORECASE=0 vs =1

<a4def703-6670-4f74-810f-4b285240035cn@googlegroups.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1122&group=comp.lang.awk#1122

copy link Newsgroups: comp.lang.awk

X-Received: by 2002:a05:600c:8a9:b0:380:da47:a911 with SMTP id l41-20020a05600c08a900b00380da47a911mr8849771wmp.102.1645677908474;
Wed, 23 Feb 2022 20:45:08 -0800 (PST)
X-Received: by 2002:a25:374d:0:b0:611:a6c1:b948 with SMTP id
e74-20020a25374d000000b00611a6c1b948mr924753yba.21.1645677907801; Wed, 23 Feb
2022 20:45:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Wed, 23 Feb 2022 20:45:07 -0800 (PST)
In-Reply-To: <sv6agm$8lf$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=96.255.232.150; posting-account=BcR7vAoAAABY9YgIIYIhD68t7wwjMvJW
NNTP-Posting-Host: 96.255.232.150
References: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com> <sv6agm$8lf$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a4def703-6670-4f74-810f-4b285240035cn@googlegroups.com>
Subject: Re: Gawk IGNORECASE=0 vs =1
From: jnaman2@gmail.com (J Naman)
Injection-Date: Thu, 24 Feb 2022 04:45:08 +0000
Content-Type: text/plain; charset="UTF-8"

by: J Naman - Thu, 24 Feb 2022 04:45 UTC

On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
> On 2/23/2022 3:34 PM, J Naman wrote:
> > Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
> > The result is no surprise, 128% difference for this one benchmark.
> > I am just reporting the quantitative difference. The wall clock time difference
> > can be non-trivial for processing large files having hundreds of
> > thousands of lines of text.
> > The benchmark was for a set of six statements:
> > four of the form:
> > if(str~/^text/) {return;}
> > plus two statements:
> > if(str~/^[A-Z]+$/) {return;}
> > if(str~/^[a-z]+$/) {return;}
> > The results over a aggregate 3 million loops:
> > Score Test
> > 817 Avg IGN=1
> > 640 Avg IGN=0
> > 128% difference
> >
> > Also, I reran the same test with "? at the beginning
> > and end of all six regexps. The scores were
> > not significantly different than the above.
> >
> > * These Scores are scaled. I was previously warned
> > not to report actual CPU or clock times for one particular system.
> Were there 128% more matches or some other difference in the matched
> strings? Without knowing what the input contained it's hard to know what
> those results mean. What were you hoping to test by adding `"?` to the
> regexps? Without knowing how IGN=1 compares to the alternative of
> `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
> this information.
>
> Ed.
Here are six results, scaled: (not surprising to me)
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer

low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"

function testmatch(str, x){ # all 7 regexp are tested every call
if(str~/^include variable function namespace x/) {x++} # lower
if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)

So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be.
tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.

Re: Gawk IGNORECASE=0 vs =1

<sv7uvt$rvj$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1123&group=comp.lang.awk#1123

copy link Newsgroups: comp.lang.awk

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: Gawk IGNORECASE=0 vs =1
Date: Thu, 24 Feb 2022 06:51:08 -0600
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <sv7uvt$rvj$1@dont-email.me>
References: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
<sv6agm$8lf$1@dont-email.me>
<a4def703-6670-4f74-810f-4b285240035cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Feb 2022 12:51:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="0f8278e0d83a9b4b34cd5bcbd6e51938";
logging-data="28659"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+nI+mnjEN+pvjo9PCLh5I1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:eciUTF9kKrey5kHihaQX3gqOj0E=
In-Reply-To: <a4def703-6670-4f74-810f-4b285240035cn@googlegroups.com>
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 220224-0, 2/23/2022), Outbound message

by: Ed Morton - Thu, 24 Feb 2022 12:51 UTC

On 2/23/2022 10:45 PM, J Naman wrote:
> On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
>> On 2/23/2022 3:34 PM, J Naman wrote:
>>> Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
>>> The result is no surprise, 128% difference for this one benchmark.
>>> I am just reporting the quantitative difference. The wall clock time difference
>>> can be non-trivial for processing large files having hundreds of
>>> thousands of lines of text.
>>> The benchmark was for a set of six statements:
>>> four of the form:
>>> if(str~/^text/) {return;}
>>> plus two statements:
>>> if(str~/^[A-Z]+$/) {return;}
>>> if(str~/^[a-z]+$/) {return;}
>>> The results over a aggregate 3 million loops:
>>> Score Test
>>> 817 Avg IGN=1
>>> 640 Avg IGN=0
>>> 128% difference
>>>
>>> Also, I reran the same test with "? at the beginning
>>> and end of all six regexps. The scores were
>>> not significantly different than the above.
>>>
>>> * These Scores are scaled. I was previously warned
>>> not to report actual CPU or clock times for one particular system.
>> Were there 128% more matches or some other difference in the matched
>> strings? Without knowing what the input contained it's hard to know what
>> those results mean. What were you hoping to test by adding `"?` to the
>> regexps? Without knowing how IGN=1 compares to the alternative of
>> `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
>> this information.
>>
>> Ed.
> Here are six results, scaled: (not surprising to me)
> IC=0 IC=1 1/0
> low 108 226 109% longer
> Mix 100 228 128% longer
> UP 111 225 102% longer
>
> low = "include variable function namespace x"
> Mix = "Include Variable Function NameSpace x"
> UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"
>
> function testmatch(str, x){ # all 7 regexp are tested every call
> if(str~/^include variable function namespace x/) {x++} # lower
> if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
> if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
> if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
> if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
> if(str~/^[A-Z ]+$/) {x++}
> if(str~/^[a-z ]+$/) {x++}
> } #eofunc testmatch(str)

Am I right in thinking that by the above you mean your test script is
basically a script that calls that function some large number of times
in a loop with 1 of the stated strings, e.g.

BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}

>
> So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
> I forced testing all 7 regexp are every call because my real data doesn't match very often.
> All of my regexp are mixed case and the file data are supposed to be.
> tolower() on both input and regexp looks to be no better than
> mixed case input to mixed case regexp
> btw: 'random case' is a quirky feature of my editor I never had any use for before.

I'm still struggling to understand what we're supposed to **do** with
the above information. I mean if we need to match a regexp against
mixed-case input we have 2 choices:

1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/

and what we cannot do is just:

3) $0 ~ /foo/

so what can we do with the information that "1" would be slower than "3"
since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.

Ed.

Re: Gawk IGNORECASE=0 vs =1

<cb191e2c-400d-43ef-8a80-7417cad19575n@googlegroups.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1125&group=comp.lang.awk#1125

copy link Newsgroups: comp.lang.awk

X-Received: by 2002:a05:622a:1881:b0:2ce:e215:df60 with SMTP id v1-20020a05622a188100b002cee215df60mr4866538qtc.670.1645746943447;
Thu, 24 Feb 2022 15:55:43 -0800 (PST)
X-Received: by 2002:a25:324c:0:b0:623:fb7d:cbc8 with SMTP id
y73-20020a25324c000000b00623fb7dcbc8mr4614755yby.397.1645746943184; Thu, 24
Feb 2022 15:55:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Thu, 24 Feb 2022 15:55:42 -0800 (PST)
In-Reply-To: <sv7uvt$rvj$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=96.255.232.150; posting-account=BcR7vAoAAABY9YgIIYIhD68t7wwjMvJW
NNTP-Posting-Host: 96.255.232.150
References: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
<sv6agm$8lf$1@dont-email.me> <a4def703-6670-4f74-810f-4b285240035cn@googlegroups.com>
<sv7uvt$rvj$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cb191e2c-400d-43ef-8a80-7417cad19575n@googlegroups.com>
Subject: Re: Gawk IGNORECASE=0 vs =1
From: jnaman2@gmail.com (J Naman)
Injection-Date: Thu, 24 Feb 2022 23:55:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 106

by: J Naman - Thu, 24 Feb 2022 23:55 UTC

On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:
> On 2/23/2022 10:45 PM, J Naman wrote:
> > On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
> >> On 2/23/2022 3:34 PM, J Naman wrote:
> >>> Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
> >>> The result is no surprise, 128% difference for this one benchmark.
> >>> I am just reporting the quantitative difference. The wall clock time difference
> >>> can be non-trivial for processing large files having hundreds of
> >>> thousands of lines of text.
> >>> The benchmark was for a set of six statements:
> >>> four of the form:
> >>> if(str~/^text/) {return;}
> >>> plus two statements:
> >>> if(str~/^[A-Z]+$/) {return;}
> >>> if(str~/^[a-z]+$/) {return;}
> >>> The results over a aggregate 3 million loops:
> >>> Score Test
> >>> 817 Avg IGN=1
> >>> 640 Avg IGN=0
> >>> 128% difference
> >>>
> >>> Also, I reran the same test with "? at the beginning
> >>> and end of all six regexps. The scores were
> >>> not significantly different than the above.
> >>>
> >>> * These Scores are scaled. I was previously warned
> >>> not to report actual CPU or clock times for one particular system.
> >> Were there 128% more matches or some other difference in the matched
> >> strings? Without knowing what the input contained it's hard to know what
> >> those results mean. What were you hoping to test by adding `"?` to the
> >> regexps? Without knowing how IGN=1 compares to the alternative of
> >> `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
> >> this information.
> >>
> >> Ed.
> > Here are six results, scaled: (not surprising to me)
> > IC=0 IC=1 1/0
> > low 108 226 109% longer
> > Mix 100 228 128% longer
> > UP 111 225 102% longer
> >
> > low = "include variable function namespace x"
> > Mix = "Include Variable Function NameSpace x"
> > UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"
> >
> > function testmatch(str, x){ # all 7 regexp are tested every call
> > if(str~/^include variable function namespace x/) {x++} # lower
> > if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
> > if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
> > if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
> > if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
> > if(str~/^[A-Z ]+$/) {x++}
> > if(str~/^[a-z ]+$/) {x++}
> > } #eofunc testmatch(str)
> Am I right in thinking that by the above you mean your test script is
> basically a script that calls that function some large number of times
> in a loop with 1 of the stated strings, e.g.
>
> BEGIN {
> low = "include variable function namespace x"
> for (i=1;i<=1000000;i++) testmatch(low)
> }
>
> >
> > So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
> > I forced testing all 7 regexp are every call because my real data doesn't match very often.
> > All of my regexp are mixed case and the file data are supposed to be.
> > tolower() on both input and regexp looks to be no better than
> > mixed case input to mixed case regexp
> > btw: 'random case' is a quirky feature of my editor I never had any use for before.
> I'm still struggling to understand what we're supposed to **do** with
> the above information. I mean if we need to match a regexp against
> mixed-case input we have 2 choices:
>
> 1) IGNORECASE=1; .. $0 ~ /foo/
> 2) tolower($0) ~ /foo/
>
> and what we cannot do is just:
>
> 3) $0 ~ /foo/
>
> so what can we do with the information that "1" would be slower than "3"
> since we can't use "3" for this anyway? If you told us that "1" was
> slower than "2" then we could use that information to write scripts
> using "2" instead of "1" but I just don't see how the speed of "1" vs
> the speed of "3" is something we can act on.
>
> Ed.
Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I have wasted people's time. John

Re: Gawk IGNORECASE=0 vs =1

<sv9758$g2t$1@dont-email.me>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1126&group=comp.lang.awk#1126

copy link Newsgroups: comp.lang.awk

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: Gawk IGNORECASE=0 vs =1
Date: Thu, 24 Feb 2022 18:16:40 -0600
Organization: A noiseless patient Spider
Lines: 95
Message-ID: <sv9758$g2t$1@dont-email.me>
References: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
<sv6agm$8lf$1@dont-email.me>
<a4def703-6670-4f74-810f-4b285240035cn@googlegroups.com>
<sv7uvt$rvj$1@dont-email.me>
<cb191e2c-400d-43ef-8a80-7417cad19575n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 25 Feb 2022 00:16:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a438fb6852d8d0f8f4619c780f566468";
logging-data="16477"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18TbOVDzeuD6uwK5PLqfbnF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:cVgqEXwyhO89ZlDC90OQC90n1wE=
In-Reply-To: <cb191e2c-400d-43ef-8a80-7417cad19575n@googlegroups.com>
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 220224-6, 2/24/2022), Outbound message

by: Ed Morton - Fri, 25 Feb 2022 00:16 UTC

On 2/24/2022 5:55 PM, J Naman wrote:
> On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:
>> On 2/23/2022 10:45 PM, J Naman wrote:
>>> On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
>>>> On 2/23/2022 3:34 PM, J Naman wrote:
>>>>> Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
>>>>> The result is no surprise, 128% difference for this one benchmark.
>>>>> I am just reporting the quantitative difference. The wall clock time difference
>>>>> can be non-trivial for processing large files having hundreds of
>>>>> thousands of lines of text.
>>>>> The benchmark was for a set of six statements:
>>>>> four of the form:
>>>>> if(str~/^text/) {return;}
>>>>> plus two statements:
>>>>> if(str~/^[A-Z]+$/) {return;}
>>>>> if(str~/^[a-z]+$/) {return;}
>>>>> The results over a aggregate 3 million loops:
>>>>> Score Test
>>>>> 817 Avg IGN=1
>>>>> 640 Avg IGN=0
>>>>> 128% difference
>>>>>
>>>>> Also, I reran the same test with "? at the beginning
>>>>> and end of all six regexps. The scores were
>>>>> not significantly different than the above.
>>>>>
>>>>> * These Scores are scaled. I was previously warned
>>>>> not to report actual CPU or clock times for one particular system.
>>>> Were there 128% more matches or some other difference in the matched
>>>> strings? Without knowing what the input contained it's hard to know what
>>>> those results mean. What were you hoping to test by adding `"?` to the
>>>> regexps? Without knowing how IGN=1 compares to the alternative of
>>>> `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
>>>> this information.
>>>>
>>>> Ed.
>>> Here are six results, scaled: (not surprising to me)
>>> IC=0 IC=1 1/0
>>> low 108 226 109% longer
>>> Mix 100 228 128% longer
>>> UP 111 225 102% longer
>>>
>>> low = "include variable function namespace x"
>>> Mix = "Include Variable Function NameSpace x"
>>> UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"
>>>
>>> function testmatch(str, x){ # all 7 regexp are tested every call
>>> if(str~/^include variable function namespace x/) {x++} # lower
>>> if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
>>> if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
>>> if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
>>> if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
>>> if(str~/^[A-Z ]+$/) {x++}
>>> if(str~/^[a-z ]+$/) {x++}
>>> } #eofunc testmatch(str)
>> Am I right in thinking that by the above you mean your test script is
>> basically a script that calls that function some large number of times
>> in a loop with 1 of the stated strings, e.g.
>>
>> BEGIN {
>> low = "include variable function namespace x"
>> for (i=1;i<=1000000;i++) testmatch(low)
>> }
>>
>>>
>>> So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
>>> I forced testing all 7 regexp are every call because my real data doesn't match very often.
>>> All of my regexp are mixed case and the file data are supposed to be.
>>> tolower() on both input and regexp looks to be no better than
>>> mixed case input to mixed case regexp
>>> btw: 'random case' is a quirky feature of my editor I never had any use for before.
>> I'm still struggling to understand what we're supposed to **do** with
>> the above information. I mean if we need to match a regexp against
>> mixed-case input we have 2 choices:
>>
>> 1) IGNORECASE=1; .. $0 ~ /foo/
>> 2) tolower($0) ~ /foo/
>>
>> and what we cannot do is just:
>>
>> 3) $0 ~ /foo/
>>
>> so what can we do with the information that "1" would be slower than "3"
>> since we can't use "3" for this anyway? If you told us that "1" was
>> slower than "2" then we could use that information to write scripts
>> using "2" instead of "1" but I just don't see how the speed of "1" vs
>> the speed of "3" is something we can act on.
>>
>> Ed.
> Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I have wasted people's time. John

Ah, now I understand what this was about. Thanks for the information.

Ed.

Re: Gawk IGNORECASE=0 vs =1

<81d4d7df-9f56-4f6e-a661-0c7359aca958n@googlegroups.com>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1177&group=comp.lang.awk#1177

copy link Newsgroups: comp.lang.awk

X-Received: by 2002:a05:622a:1051:b0:2e1:eb06:ecc2 with SMTP id f17-20020a05622a105100b002e1eb06ecc2mr23696977qte.171.1648497544128;
Mon, 28 Mar 2022 12:59:04 -0700 (PDT)
X-Received: by 2002:a81:9104:0:b0:2e5:b044:2ac2 with SMTP id
i4-20020a819104000000b002e5b0442ac2mr27688625ywg.498.1648497543965; Mon, 28
Mar 2022 12:59:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!45.76.7.193.MISMATCH!3.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Mon, 28 Mar 2022 12:59:03 -0700 (PDT)
In-Reply-To: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2603:7000:3c3d:41c0:0:0:0:3c3;
posting-account=n74spgoAAAAZZyBGGjbj9G0N4Q659lEi
NNTP-Posting-Host: 2603:7000:3c3d:41c0:0:0:0:3c3
References: <a752da8c-86ec-4c9d-aefc-895100efdd29n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <81d4d7df-9f56-4f6e-a661-0c7359aca958n@googlegroups.com>
Subject: Re: Gawk IGNORECASE=0 vs =1
From: jason.cy.kwan@gmail.com (Kpop 2GM)
Injection-Date: Mon, 28 Mar 2022 19:59:04 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 98

by: Kpop 2GM - Mon, 28 Mar 2022 19:59 UTC

% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'tolower($0)~"^include variable function namespace x$"' FS='^$' ) | pvE9)| wc5

in0: 3.40GiB 0:00:04 [ 818MiB/s] [ 818MiB/s] [=============================>] 100%
out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.07s user 0.79s system 113% cpu 4.275 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5
in0: 80.2MiB 0:00:00 [ 801MiB/s] [ 801MiB/s] [> ] 2% ETA 0:00:00
out9: 64.3MiB 0:00:04 [15.1MiB/s] [15.1MiB/s] [ <=> ]
in0: 3.40GiB 0:00:04 [ 820MiB/s] [ 820MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 4.06s user 0.79s system 113% cpu 4.268 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

% ( time ( pvE0 < testnamespacecase_9999999.txt | mawk2 '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^$' )| pvE9) | wc5

out9: 64.3MiB 0:00:02 [32.0MiB/s] [32.0MiB/s] [ <=> ]
in0: 3.40GiB 0:00:02 [1.70GiB/s] [1.70GiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | mawk2 FS='^$'; ) 1.58s user 0.82s system 118% cpu 2.026 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

What I'm seeing is if you simply make EVERY letter a combo test of both upper and lower cases, and also prevent it from splitting fields, it's more than 200% time savings. And that's only for mawk-2. For gawk, the savings are unearthly :

out9: 64.3MiB 0:00:43 [1.48MiB/s] [1.48MiB/s] [ <=> ]
in0: 3.40GiB 0:00:43 [80.5MiB/s] [80.5MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 43.01s user 1.12s system 101% cpu 43.317 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

in0: 3.40GiB 0:00:44 [78.0MiB/s] [78.0MiB/s] [=============================>] 100%
out9: 64.3MiB 0:00:44 [1.44MiB/s] [1.44MiB/s] [ <=> ]
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 44.35s user 1.14s system 101% cpu 44.671 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

out9: 64.3MiB 0:00:05 [10.7MiB/s] [10.7MiB/s] [ <=> ]
in0: 3.40GiB 0:00:05 [ 582MiB/s] [ 582MiB/s] [=============================>] 100%
( pvE 0.1 in0 < testnamespacecase_9999999.txt | gawk -Se FS='^$'; ) 5.83s user 0.81s system 110% cpu 6.006 total
rows = 1773954. | UTF8 chars = 67410252. | bytes = 67410252.

==================== echo; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'toupper($0)~"^INCLUDE VARIABLE FUNCTION NAMESPACE X$"' FS='^$' ) | pvE9)| wc5; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se 'tolower($0)~"^include variable function namespace x$"' FS='^$' ) | pvE9)| wc5 ; sleep 1; ( time ( pvE0 < testnamespacecase_9999999.txt | gawk -Se '/^[Ii][Nn][Cc][Ll][Uu][Dd][Ee] [Vv][Aa][Rr][Ii][Aa][Bb][Ll][Ee] [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] [Nn][Aa][Mm][Ee][Ss][Pp][Aa][Cc][Ee] [Xx]$/' FS='^$' )| pvE9) | wc5
==============================================================
The 4Chan Teller

Lack of skill dictates economy of style. -- Joey Ramone

devel / comp.lang.awk / Re: Gawk IGNORECASE=0 vs =1

Subject	Author
Gawk IGNORECASE=0 vs =1	J Naman
Re: Gawk IGNORECASE=0 vs =1	Ed Morton
Re: Gawk IGNORECASE=0 vs =1	J Naman
Re: Gawk IGNORECASE=0 vs =1	Ed Morton
Re: Gawk IGNORECASE=0 vs =1	J Naman
Re: Gawk IGNORECASE=0 vs =1	Ed Morton
Re: Gawk IGNORECASE=0 vs =1	Kpop 2GM