Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Asynchronous inputs are at the root of our race problems. -- D. Winker and F. Prosser


devel / comp.lang.awk / Re: why does patsplit() exist?

SubjectAuthor
* Re: why does patsplit() exist?Kpop 2GM
+- Re: why does patsplit() exist?Kpop 2GM
+- [meta] Why? (was Re: why does patsplit() exist?)Janis Papanagnou
`- Re: why does patsplit() exist?Ed Morton

1
Re: why does patsplit() exist?

<2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=892&group=comp.lang.awk#892

  copy link   Newsgroups: comp.lang.awk
X-Received: by 2002:a37:8287:: with SMTP id e129mr11737022qkd.415.1630943076633; Mon, 06 Sep 2021 08:44:36 -0700 (PDT)
X-Received: by 2002:a25:2cd5:: with SMTP id s204mr14718156ybs.137.1630943076457; Mon, 06 Sep 2021 08:44:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.swapon.de!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Mon, 6 Sep 2021 08:44:36 -0700 (PDT)
In-Reply-To: <s613uo$ie7$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2603:7000:3c3d:41c0:0:0:0:637; posting-account=n74spgoAAAAZZyBGGjbj9G0N4Q659lEi
NNTP-Posting-Host: 2603:7000:3c3d:41c0:0:0:0:637
References: <s5etmc$jjm$1@dont-email.me> <s5ftrt$8in$1@dont-email.me> <4683ebd3-fcba-487c-88a2-26bd8e411a92n@googlegroups.com> <s5hdkg$mli$1@dont-email.me> <s613uo$ie7$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>
Subject: Re: why does patsplit() exist?
From: jason.cy.kwan@gmail.com (Kpop 2GM)
Injection-Date: Mon, 06 Sep 2021 15:44:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 229
 by: Kpop 2GM - Mon, 6 Sep 2021 15:44 UTC

On Saturday, April 24, 2021 at 8:46:18 AM UTC-4, Ed Morton wrote:
> On 4/18/2021 8:53 AM, Ed Morton wrote:
> > On 4/17/2021 10:39 PM, J Naman wrote:
> >> On Saturday, 17 April 2021 at 20:18:07 UTC-4, Ed Morton wrote:
> >>> On 4/17/2021 10:08 AM, Ed Morton wrote:
> >>>> In gawk 4.0 two similar changes were introduced:
> >>>>
> >>>> 1) patsplit() - a new function to split a string into array elements
> >>>> that match a regexp
> >>>> 2) split() was given a 4th argument to store the strings that match the
> >>>> separator regexp in an array.
> >>>>
> >>>> For example:
> >>>>
> >>>> $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
> >>>> print vals[i] }'
> >>>> 13
> >>>> 27
> >>>>
> >>>> $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in
> >>>> vals)
> >>>> print vals[i] }'
> >>>> 13
> >>>> 27
> >>>>
> >>>> Given the awk language traditionally only provides constructs that are
> >>>> hard to implement with other existing constructs and that both items
> >>>> were introduced in the same release there must be something I'm missing
> >>>> - what is it that patsplit() provides that's hard to implement with
> >>>> split()?
> >>>>
> >>>> Ed.
> >>> Thanks for the response Ben & Janis. You both gave an example of a case
> >>> I hadn't considered which is where an empty string could match the
> >>> regexp:
> >>>
> >>> Janis:
> >>> -----
> >>> $ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals)
> >>> print i, vals[i] }'
> >>> 1 Hello
> >>> 2
> >>> 3 World
> >>>
> >>> $ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
> >>> vals) print i, vals[i] }'
> >>> 1 Hello
> >>> 2 World
> >>> -----
> >>>
> >>> Ben:
> >>> -----
> >>> $ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
> >>> print i, vals[i] }'
> >>> 1
> >>> 2
> >>> 3
> >>> 4 13
> >>> 5
> >>> 6
> >>> 7 27
> >>>
> >>> $ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals)
> >>> print i, vals[i] }'
> >>> 1 13
> >>> 2 27
> >>> -----
> >>>
> >>> Is that the only difference - whether or not an empty string can match
> >>> the regexp?
> >>>
> >>> Ed.
> >>
> >> Maybe patsplit() is a convenience, but it is very important to me. In
> >> addition to CSV files, I use patsplit() to extract all numeric
> >> percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from
> >> highly UNSTRUCTURED text files that are aggregations of text lines
> >> from multiple sources. The text lines that I get have misspellings,
> >> non-standard abbreviations, bizarre punctuation -- "unNatural Language
> >> Processing". The extracted numeric data then clue me to how to process
> >> the sep[] text data.
> >> Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract
> >> all embedded yields (none, some, a lot)
> >>
> >
> > Wouldn't you get the same output from
> >
> > split($0, seps, /[0-9]*[.][0-9]*%/, arr)
> >
> > though?
> >
> > I'm just trying to understand what patsplit() does differently from
> > split() with the array names swapped and so far Ben and Janis gave an
> > example where it handles null strings differently - best I can tell that
> > wouldn't apply in the case you describe so is there some other difference?
> >
> > Ed.
> Well, best I can tell that handling of null strings that match the
> regexp is the only difference between the 3rd arg for patsplit() and the
> 3rd arg for split() other than the cases where split() is using either
> of the special-case FSs of "" or " ".
>
> So the key is that split() takes an FS for the 3rd arg while patsplit()
> takes a regexp and while a FS is regexp-like, it has 3 special cases
> that make it different from regexps:
>
> 1) FS = "" -> undefined by POSIX, some awks split into chars.
> 2) FS = " " -> leading/trailing spaces ignored, split on contiguous spaces.
> 3) FS = a regexp that can match a null string -> treat it like a regexp
> that cannot match a null string (e.g. `,*` gets treated like `,+`).
>
> While that 3rd point makes sense I couldn't actually find anything
> documenting the fact that a field separator isn't allowed to match a
> null string (except in the case of FS="" in some awks). POSIX says:
>
> ---------
> The following describes FS behavior:
>
> If FS is a null string, the behavior is unspecified.
>
> If FS is a single character:
>
> If FS is <space>, skip leading and trailing <blank> and
> <newline> characters; fields shall be delimited by sets of one or more
> <blank> or <newline> characters.
>
> Otherwise, if FS is any other character c, fields shall be
> delimited by each single occurrence of c.
>
> Otherwise, the string value of FS shall be considered to be an
> extended regular expression. Each occurrence of a sequence matching the
> extended regular expression shall delimit fields.
> ---------
>
> so in the case of `-F'[^,]*', for example, that falls into the final
> case above. It should really say "...a sequence _of 1 or more
> characters_ matching..." I suppose.
>
> That difference makes it non-trivial to implement patsplit() using
> existing functionality (i.e. split() with the args swapped). Thanks to
> all who replied.
>
> Ed.

I wrote this proof-of-concept for emulating patsplit functionality even without gawk :

mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"; OFS=" :: "; } END { mypat="\352[\\200-\\277][\\200-\\277]|[\\353\\354][\\200-\\277][\\200-\\277]|\355[\\200-\\235][\\200-\\277]"; print gsub(mypat "|(" mypat ")( |"mypat")*("mypat")", sepFS "&" sepFS); gsub(sepFS "("sepFS")+", ""); print nx=split($0,arr, sepFS );for(x=1;x<=nx; x++) { print "\t" x ": /." arr[x] "./" ; }}'

the test pattern here is all 11,172 korean hangul syllables. at the same time, i also didnt want it to chop korean phrases up due to space character, while preventing latin ASCII from splitting from space character. mawk2, that isn't unicode aware whatsoever, can split out nearly 72000-cell array in 3.24 seconds, with all the hangul in the even # cells, and all the ""non-pat", if you will, in the odd numbered ones.

The trick is simply use a sep string that nearly never exists in proper UTF8 inputs - i only included 2 UTF-8 illegal bytes xC1 xFA \301\372. you can do a quick scan of the data, and if xC1 \301 doesn't show up at all then just use a single byte xC1 as your sep. if it *does* show up, it's possibly you're working with binary data streams, in which case, keep padding the sep string with a byte you deem very unlikely to show up

(tip : don't bother with x00 \000 and xFF \377. those 2 bytes are *very* common in a variety of binary file formats)

70940: /.베리베리./
70941: /.XTOO @3./
70942: /.차 경연./
70943: /. <./
70944: /.컬래버레이션 무대./
70945: /.>=artist=14958011=VOD 657 ▸ ./
70946: /.로드 투 킹덤./
70947: /.=year=2020=05-29=secs=251=mstr=NoF=tile=t=info=1280=720=00:04:11=gnr=31219=2908-vod1=clipID=MA_306857=song=.=[Full CAM] ♬ ON - ./
70948: /.베리베리./
70949: /.XTOO @3./
70950: /.차 경연./
70951: /. <./
70952: /.컬래버레이션 무대./
70953: /.>./
70954: /.킹덤으로 가려는자./
70955: /., ./
70956: /.살아남아라./
70957: /.!Mnet <./
70958: /.로드 투 킹덤./
70959: /. (Road to Kingdom) >./
70960: /.매주 목요일 저녁../
70961: /. 8./


Click here to read the complete article
Re: why does patsplit() exist?

<be18427a-6052-4d30-a511-06fb68278837n@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=893&group=comp.lang.awk#893

  copy link   Newsgroups: comp.lang.awk
X-Received: by 2002:ac8:7b4a:: with SMTP id m10mr11580934qtu.121.1630943586582; Mon, 06 Sep 2021 08:53:06 -0700 (PDT)
X-Received: by 2002:a25:4545:: with SMTP id s66mr16864775yba.191.1630943586207; Mon, 06 Sep 2021 08:53:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Mon, 6 Sep 2021 08:53:05 -0700 (PDT)
In-Reply-To: <2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2603:7000:3c3d:41c0:0:0:0:637; posting-account=n74spgoAAAAZZyBGGjbj9G0N4Q659lEi
NNTP-Posting-Host: 2603:7000:3c3d:41c0:0:0:0:637
References: <s5etmc$jjm$1@dont-email.me> <s5ftrt$8in$1@dont-email.me> <4683ebd3-fcba-487c-88a2-26bd8e411a92n@googlegroups.com> <s5hdkg$mli$1@dont-email.me> <s613uo$ie7$1@dont-email.me> <2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <be18427a-6052-4d30-a511-06fb68278837n@googlegroups.com>
Subject: Re: why does patsplit() exist?
From: jason.cy.kwan@gmail.com (Kpop 2GM)
Injection-Date: Mon, 06 Sep 2021 15:53:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 32
 by: Kpop 2GM - Mon, 6 Sep 2021 15:53 UTC

706571: /. 8./
706572: /.시./
706573: /.!!./
706574: /.꿀케미 커넥션./
706575: /. <./
706576: /.랜선친구 아이오아이./
706577: /.>./
706578: /.엠넷 본방사수./
706579: /.=albm=.=277 ▸ ./
706580: /.프로듀스./
706581: /. 101=G0636~ADV009~P2015~E0933647=VOD Various Artists=mnetA-487082../

real 0m2.368s
user 0m0.757s
sys 0m0.315s

just tested with mawk 1.3.4 : only 2.4 seconds to split out array with over 700K cells, and the korean strings just by themselves. at lucky times, the english translated names will be conveniently placed in adjacent cells, e.g.

691357: /., ./
691358: /.원포유./
691359: /. (14U) , ./
691360: /.에이프릴./
691361: /. (APRIL) , ./
691362: /.혜이니./
691363: /. (HEYNE) , ./

[meta] Why? (was Re: why does patsplit() exist?)

<sh5ep8$1ae7$1@gioia.aioe.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=895&group=comp.lang.awk#895

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!aioe.org!I4G2pI/TjSgw9iO3pv/WTg.user.46.165.242.75.POSTED!not-for-mail
From: janis_papanagnou@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.awk
Subject: [meta] Why? (was Re: why does patsplit() exist?)
Date: Mon, 6 Sep 2021 18:14:32 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sh5ep8$1ae7$1@gioia.aioe.org>
References: <s5etmc$jjm$1@dont-email.me> <s5ftrt$8in$1@dont-email.me>
<4683ebd3-fcba-487c-88a2-26bd8e411a92n@googlegroups.com>
<s5hdkg$mli$1@dont-email.me> <s613uo$ie7$1@dont-email.me>
<2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="43463"; posting-host="I4G2pI/TjSgw9iO3pv/WTg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
X-Enigmail-Draft-Status: N1110
X-Notice: Filtered by postfilter v. 0.9.2
 by: Janis Papanagnou - Mon, 6 Sep 2021 16:14 UTC

On 06.09.2021 17:44, Kpop 2GM wrote:
>
> mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"; OFS=" :: "; } END { mypat="\352[\\200-\\277][\\200-\\277]|[\\353\\354][\\200-\\277][\\200-\\277]|\355[\\200-\\235][\\200-\\277]"; print gsub(mypat "|(" mypat ")( |"mypat")*("mypat")", sepFS "&" sepFS); gsub(sepFS "("sepFS")+", ""); print nx=split($0,arr, sepFS );for(x=1;x<=nx; x++) { print "\t" x ": /." arr[x] "./" ; }}'

Even if the standard screen width in Korea is 370+ columns, is
there any advantage writing all in one line, unformatted? Or
is there a prize for doing that? Or are you assuming that this
newsgroup is read solely by awk interpreter software (and not
by humans)? - I wonder what folks are thinking when doing so.
Or whether they are thinking at all. Or whether that is some
regional mindset. Or a mental disease. Or a religious dogma.
Or a statement of political or social protest. - Why the heck!

Re: why does patsplit() exist?

<sh5icq$4b1$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=896&group=comp.lang.awk#896

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: why does patsplit() exist?
Date: Mon, 6 Sep 2021 12:16:09 -0500
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <sh5icq$4b1$1@dont-email.me>
References: <s5etmc$jjm$1@dont-email.me> <s5ftrt$8in$1@dont-email.me>
<4683ebd3-fcba-487c-88a2-26bd8e411a92n@googlegroups.com>
<s5hdkg$mli$1@dont-email.me> <s613uo$ie7$1@dont-email.me>
<2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 6 Sep 2021 17:16:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3354c5293b6107be3e5b1cdd2d066548";
logging-data="4449"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX198NpDOn4cPWDDm+XGVOGRC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:so3cC59KdCgoZCtrzn0Z+hwFvJE=
In-Reply-To: <2aeaa318-6498-4e41-a6b6-cedfe6718023n@googlegroups.com>
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 210906-4, 9/6/2021), Outbound message
 by: Ed Morton - Mon, 6 Sep 2021 17:16 UTC

On 9/6/2021 10:44 AM, Kpop 2GM wrote:
<snip>
> I wrote this proof-of-concept for emulating patsplit functionality even without gawk :
>
> mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"

That's still relying on an extension to POSIX awk for multi-char RS. A
POSIX awk would treat that like `RS="^"`. I'm not going to try to read
the rest of the script since it was all crammed onto 1 line. Janis's
response covers that situation well!

Ed.

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor