Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Remember the good old days, when CPU was singular?


devel / comp.lang.awk / Unique Characters Only

SubjectAuthor
* Unique Characters OnlyMike Sanders
+* Re: Unique Characters OnlyJanis Papanagnou
|+- Re: Unique Characters OnlyMike Sanders
|`* Re: Unique Characters OnlyJ Naman
| `- Re: Unique Characters OnlyJanis Papanagnou
+* Re: Unique Characters Onlyyeti
|+* Re: Unique Characters OnlyVroomfondel
||`- Re: Unique Characters OnlyMike Sanders
|`- Re: Unique Characters OnlyMike Sanders
+- Re: Unique Characters OnlyMike Sanders
`* Re: Unique Characters OnlyEd Morton
 `- Re: Unique Characters OnlyMike Sanders

1
Unique Characters Only

<ufbelp$1gcjp$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1587&group=comp.lang.awk#1587

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: porkchop@invalid.foo (Mike Sanders)
Newsgroups: comp.lang.awk
Subject: Unique Characters Only
Date: Sun, 1 Oct 2023 09:38:01 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 43
Sender: Mike Sanders <busybox@sdf.org>
Message-ID: <ufbelp$1gcjp$1@dont-email.me>
Injection-Date: Sun, 1 Oct 2023 09:38:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3f06f443a7cf2a5f95dfb6f18fd3818e";
logging-data="1585785"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/3AMIo3yV6wXbCNsvDhXnZ"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock: sha1:eM6hm1ZxZCNghdjbS3vBXCSuAVc=
 by: Mike Sanders - Sun, 1 Oct 2023 09:38 UTC

run as...

awk -f uniqueChars.awk

output...

Input string: Mary had a little lamb who's fleece was white as snow...
Unique chars: Mary hdlitembwo'sfcn.

script...

BEGIN {

a = "Mary had a little lamb who's fleece was white as snow..."
b = uniqueChars(a)

print "Input string: " a
print "Unique chars: " b

}

function uniqueChars(str, x, y, c, tmp, uniqueStr) {

y = length(str)
uniqueStr = ""
delete tmp # clear array for each new string

while(++x <= y) {
c = substr(str, x, 1)
if (!(c in tmp)) {
uniqueStr = uniqueStr c
tmp[c]
}
}

return uniqueStr

}

--
:wq
Mike Sanders

Re: Unique Characters Only

<ufbg9k$1gnoj$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1588&group=comp.lang.awk#1588

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 1 Oct 2023 12:05:39 +0200
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <ufbg9k$1gnoj$1@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Oct 2023 10:05:40 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cb9e02d9ce5241eb5d07d33bbe42e002";
logging-data="1597203"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19eYe/jIxeD4J8g0Dz4a6jC"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:kAJUYpLvdneM4CWH1RRbrnW6doQ=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <ufbelp$1gcjp$1@dont-email.me>
 by: Janis Papanagnou - Sun, 1 Oct 2023 10:05 UTC

If you want to avoid the substr() function calls in the loop and don't
mind using non-standard (GNU Awk) features you can also use split():

function uniqueChars (t, s, n, i, c, o, seen)
{ delete seen
n = split (t, s, "")
for (i=1; i<=n; i++)
if (!seen[c = s[i]]++)
o = o c

return o
}

{
printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
}

(Just for a variant.)

Janis

On 01.10.2023 11:38, Mike Sanders wrote:
> run as...
>
> awk -f uniqueChars.awk
>
> output...
>
> Input string: Mary had a little lamb who's fleece was white as snow...
> Unique chars: Mary hdlitembwo'sfcn.
>
> script...
>
> BEGIN {
>
> a = "Mary had a little lamb who's fleece was white as snow..."
> b = uniqueChars(a)
>
> print "Input string: " a
> print "Unique chars: " b
>
> }
>
> function uniqueChars(str, x, y, c, tmp, uniqueStr) {
>
> y = length(str)
> uniqueStr = ""
> delete tmp # clear array for each new string
>
> while(++x <= y) {
> c = substr(str, x, 1)
> if (!(c in tmp)) {
> uniqueStr = uniqueStr c
> tmp[c]
> }
> }
>
> return uniqueStr
>
> }
>

Re: Unique Characters Only

<ufbhgb$1guhb$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1589&group=comp.lang.awk#1589

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: porkchop@invalid.foo (Mike Sanders)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 1 Oct 2023 10:26:19 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 31
Sender: Mike Sanders <busybox@sdf.org>
Message-ID: <ufbhgb$1guhb$1@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me> <ufbg9k$1gnoj$1@dont-email.me>
Injection-Date: Sun, 1 Oct 2023 10:26:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3f06f443a7cf2a5f95dfb6f18fd3818e";
logging-data="1604139"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+S8wFSpaSZIau3SovGa4y3"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock: sha1:qMdXHm8+AfVvmdClk22RwjjDNXk=
 by: Mike Sanders - Sun, 1 Oct 2023 10:26 UTC

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

> If you want to avoid the substr() function calls in the loop and don't
> mind using non-standard (GNU Awk) features you can also use split():
>
> function uniqueChars (t, s, n, i, c, o, seen)
> {
> delete seen
> n = split (t, s, "")
> for (i=1; i<=n; i++)
> if (!seen[c = s[i]]++)
> o = o c
>
> return o
> }
>
> {
> printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
> }
>
>
> (Just for a variant.)
>
> Janis

Janis you are good! =)

--
:wq
Mike Sanders

Re: Unique Characters Only

<87sf6u1vt6.fsf@tilde.institute>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1590&group=comp.lang.awk#1590

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: yeti@tilde.institute (yeti)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 01 Oct 2023 10:28:05 +0000
Organization: Democratic Order of Pirates International (DOPI)
Lines: 9
Message-ID: <87sf6u1vt6.fsf@tilde.institute>
References: <ufbelp$1gcjp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: dont-email.me; posting-host="d5d0a25f0e94c188036849c1957cc145";
logging-data="1605493"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MjqMjXExup1rHvc6psfuL"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:X/HYFmc/Xtf4VT3zPrIG2iNj+Zg=
sha1:7MHfCCVTEZo8+Gym5iwVaQNVb6I=
X-Face: ]_G&_b@O$RF(L7zT;DQ3-VU}c"F/_Mgy(4^P1,Tt^#0Cq+\qM&-h\&Z.3UuiwV")n~b;26e
5-s.cF/5tMdha-:]4eBHC9vBXnz4_aNe@d4oijVyix?>pC=tzuQhoD2A8P02+\xO4gNfRBE
`B<kE3T-Gps_d0_6`+0W3E9{D
 by: yeti - Sun, 1 Oct 2023 10:28 UTC

You can use `ìndex` to avoid adding the same char multiple times to the
result string. That may or may not be faster than using an additional
array, I've not benchmarked it (yet?). You may call me Vroomfondel
today.

--
1. Hitchhiker 25: (59) Scarcely pausing for breath, Vroomfondel shouted,
"We don't demand solid facts! What we demand is a total absence of solid
facts. I demand that I may or may not be Vroomfondel!"

Re: Unique Characters Only

<87o7hi1vgm.fsf@tilde.institute>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1592&group=comp.lang.awk#1592

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: yeti@tilde.institute (Vroomfondel)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 01 Oct 2023 10:35:37 +0000
Organization: My little echo chamber.
Lines: 27
Message-ID: <87o7hi1vgm.fsf@tilde.institute>
References: <ufbelp$1gcjp$1@dont-email.me> <87sf6u1vt6.fsf@tilde.institute>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="d5d0a25f0e94c188036849c1957cc145";
logging-data="1605493"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Frw3S1GioyMSr79DPIAWh"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:11ZOov1RpQ+gVN/9WXklyUzV7ik=
sha1:G8Izhm5c0b+C7K6FdE8LhstxuMI=
X-Face: +nC9IV-=K@L=/)p^fu]8NuY`Bg&$QfuYokvS*BweD8ItC>N)7g&<XQo[_)i,zQ0tir%%gAR
yP2tq^P-Q1/>wz?ADU)hi@vAXc>wTxV^?bFTEYdL?>%?8_+h[3r.#F
 by: Vroomfondel - Sun, 1 Oct 2023 10:35 UTC

Reading beyond this line equals signing an implicit "Non Laughing
Agrement". ;-)

Only minimally tested:

function uniqueChars(str,_c,_i,_seen) {
_seen=""
for(_i=1;_i<=length(str);_i++) {
if(0<index(str,_c=substr(str,_i,1)))
if(index(_seen,_c)<1)
_seen=_seen _c
}
return _seen
}

Maybe the assumption that array ops are more expensive than string ops
is wrong and this idea is worse than the original.

Next stop: Recursion!
(Just joking!)

--
1. Hitchhiker 25: (59) Scarcely pausing for breath, Vroomfondel shouted,
"We don't demand solid facts! What we demand is a total absence of solid
facts. I demand that I may or may not be Vroomfondel!"

Re: Unique Characters Only

<ufbk7o$1hfrl$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1593&group=comp.lang.awk#1593

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: porkchop@invalid.foo (Mike Sanders)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 1 Oct 2023 11:12:57 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 13
Sender: Mike Sanders <busybox@sdf.org>
Message-ID: <ufbk7o$1hfrl$1@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me> <87sf6u1vt6.fsf@tilde.institute>
Injection-Date: Sun, 1 Oct 2023 11:12:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3f06f443a7cf2a5f95dfb6f18fd3818e";
logging-data="1621877"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18AZONv6Z+XBJAhzZcxEx2T"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock: sha1:+AWxxdRj2B4AppSbcmMgGrLMqvQ=
 by: Mike Sanders - Sun, 1 Oct 2023 11:12 UTC

yeti <yeti@tilde.institute> wrote:

> You can use `??ndex` to avoid adding the same char multiple times to the
> result string. That may or may not be faster than using an additional
> array, I've not benchmarked it (yet?). You may call me Vroomfondel
> today.

Hey, sounds interesting, please post your complete code.

--
:wq
Mike Sanders

Re: Unique Characters Only

<ufbkcd$1hfrl$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1594&group=comp.lang.awk#1594

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: porkchop@invalid.foo (Mike Sanders)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 1 Oct 2023 11:15:25 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 32
Sender: Mike Sanders <busybox@sdf.org>
Message-ID: <ufbkcd$1hfrl$2@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me> <87sf6u1vt6.fsf@tilde.institute> <87o7hi1vgm.fsf@tilde.institute>
Injection-Date: Sun, 1 Oct 2023 11:15:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3f06f443a7cf2a5f95dfb6f18fd3818e";
logging-data="1621877"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19E8jPjdkitzUftGGvSLiW4"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock: sha1:yPz/E55nqUVg0fU9irKusoE+Y8k=
 by: Mike Sanders - Sun, 1 Oct 2023 11:15 UTC

Vroomfondel <yeti@tilde.institute> wrote:

> Reading beyond this line equals signing an implicit "Non Laughing
> Agrement". ;-)
>
>
> Only minimally tested:
>
> function uniqueChars(str,_c,_i,_seen) {
> _seen=""
> for(_i=1;_i<=length(str);_i++) {
> if(0<index(str,_c=substr(str,_i,1)))
> if(index(_seen,_c)<1)
> _seen=_seen _c
> }
> return _seen
> }
>
> Maybe the assumption that array ops are more expensive than string ops
> is wrong and this idea is worse than the original.

But then again, maybe not. I must study this more (thank you!).
> Next stop: Recursion!
> (Just joking!)

Chuckle, GNU acronyms come to mind...

--
:wq
Mike Sanders

Re: Unique Characters Only

<ufdjj8$2pq24$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1602&group=comp.lang.awk#1602

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: porkchop@invalid.foo (Mike Sanders)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Mon, 2 Oct 2023 05:14:16 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 41
Sender: Mike Sanders <busybox@sdf.org>
Message-ID: <ufdjj8$2pq24$1@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me>
Injection-Date: Mon, 2 Oct 2023 05:14:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b80093a998c08ec5e7b71710a7b0258d";
logging-data="2943044"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Rx4llB2O5cWhhXDtldTLy"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock: sha1:ir0c5DSUogBEdQWZWi3M2oAYudA=
 by: Mike Sanders - Mon, 2 Oct 2023 05:14 UTC

Mike Sanders <porkchop@invalid.foo> wrote:

> function uniqueChars(str, x, y, c, tmp, uniqueStr) {
>
> y = length(str)
> uniqueStr = ""
> delete tmp # clear array for each new string
>
> while(++x <= y) {
> c = substr(str, x, 1)
> if (!(c in tmp)) {
> uniqueStr = uniqueStr c
> tmp[c]
> }
> }
>
> return uniqueStr
>
> }
>

okay, got rid of tmp array...

function uniqueChars(str, x, y, c, uniqueStr) {

y = length(str)
uniqueStr = ""

while(++x <= y) {
c = substr(str, x, 1)
if (index(uniqueStr, c) == 0) uniqueStr = uniqueStr c
}

return uniqueStr

}

--
:wq
Mike Sanders

Re: Unique Characters Only

<412a1c3d-05df-469f-9630-97107a70e91dn@googlegroups.com>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1611&group=comp.lang.awk#1611

  copy link   Newsgroups: comp.lang.awk
X-Received: by 2002:a05:620a:c18:b0:773:f15d:3c07 with SMTP id l24-20020a05620a0c1800b00773f15d3c07mr2777qki.3.1696267310953;
Mon, 02 Oct 2023 10:21:50 -0700 (PDT)
X-Received: by 2002:a05:6830:6b44:b0:6bd:c74e:f21d with SMTP id
dc4-20020a0568306b4400b006bdc74ef21dmr3371253otb.4.1696267310618; Mon, 02 Oct
2023 10:21:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.awk
Date: Mon, 2 Oct 2023 10:21:50 -0700 (PDT)
In-Reply-To: <ufbg9k$1gnoj$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=151.200.57.7; posting-account=BcR7vAoAAABY9YgIIYIhD68t7wwjMvJW
NNTP-Posting-Host: 151.200.57.7
References: <ufbelp$1gcjp$1@dont-email.me> <ufbg9k$1gnoj$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <412a1c3d-05df-469f-9630-97107a70e91dn@googlegroups.com>
Subject: Re: Unique Characters Only
From: jnaman2@gmail.com (J Naman)
Injection-Date: Mon, 02 Oct 2023 17:21:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: J Naman - Mon, 2 Oct 2023 17:21 UTC

On Sunday, 1 October 2023 at 06:05:43 UTC-4, Janis Papanagnou wrote:
> If you want to avoid the substr() function calls in the loop and don't
> mind using non-standard (GNU Awk) features you can also use split():
>
> function uniqueChars (t, s, n, i, c, o, seen)
> {
> delete seen
> n = split (t, s, "")
> for (i=1; i<=n; i++)
> if (!seen[c = s[i]]++)
> o = o c
>
> return o
> }
>
> {
> printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
> }
>
>
> (Just for a variant.)
>
> Janis
> On 01.10.2023 11:38, Mike Sanders wrote:
> > run as...
> >
> > awk -f uniqueChars.awk
> >
> > output...
> >
> > Input string: Mary had a little lamb who's fleece was white as snow...
> > Unique chars: Mary hdlitembwo'sfcn.
> >
> > script...
> >
> > BEGIN {
> >
> > a = "Mary had a little lamb who's fleece was white as snow..."
> > b = uniqueChars(a)
> >
> > print "Input string: " a
> > print "Unique chars: " b
> >
> > }
> >
> > function uniqueChars(str, x, y, c, tmp, uniqueStr) {
> >
> > y = length(str)
> > uniqueStr = ""
> > delete tmp # clear array for each new string
> >
> > while(++x <= y) {
> > c = substr(str, x, 1)
> > if (!(c in tmp)) {
> > uniqueStr = uniqueStr c
> > tmp[c]
> > }
> > }
> >
> > return uniqueStr
> >
> > }
> >
The Unique Chars Only discussion of string Op variants -- substr() split() index() -- reminded me of an interesting note in the most excellent Awk book, 2nd Edition. They benchmark split() vs substr() for single character operations and substr() is 40% faster than split(). I always assumed that split() was faster.

Re: Unique Characters Only

<uff35p$32uf0$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1612&group=comp.lang.awk#1612

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Mon, 2 Oct 2023 20:46:16 +0200
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <uff35p$32uf0$1@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me> <ufbg9k$1gnoj$1@dont-email.me>
<412a1c3d-05df-469f-9630-97107a70e91dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 2 Oct 2023 18:46:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="06408dbbbd3611f922512167001e2ecc";
logging-data="3242464"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RPantlpaMrYc8no2Xv68Q"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:XqSVQPHgFRgmx6JjbPelp6H2udE=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <412a1c3d-05df-469f-9630-97107a70e91dn@googlegroups.com>
 by: Janis Papanagnou - Mon, 2 Oct 2023 18:46 UTC

On 02.10.2023 19:21, J Naman wrote:
>
> The Unique Chars Only discussion of string Op variants -- substr()
> split() index() -- reminded me of an interesting note in the most
> excellent Awk book, 2nd Edition. They benchmark split() vs substr()
> for single character operations and substr() is 40% faster than
> split(). I always assumed that split() was faster.

In many cases it's not obvious until it is measured in practice!
Specifically you (usually) have to consider also the data sizes.

(Also note that the Book may not consider GNU Awk's special case
of split'ing characters that may perform better or worse.)

And you may also consider other factors (if the speed difference
is not that significant, as with the small data set we have here),
e.g. what construct is better maintainable (with various aspects).
I'd expect that substr(s,p,1) is extremely fast because that should
basically be only an indexed access on a linear array[*], and such
operations are often performed in one CPU cycle. Considering other
implementations for this substring from other languages, say, s[p]
will not give the impression of a performance problem - there's no
explicit function call but there shouldn't be much difference given
the primitivity of this operation[**].

The reason for my split()-based variant was to have no function call
overheads and algorithmically removing an "invariant" from the loop
(preparing the data). Note that the access of the split-array might
also be a [fast] indexed access, or it may be a [fast] hash-array
access (as I think it is in Awk).

But that all said you also have to consider that arrays are memory
consumptive! If you have huge data sets it will be considerable
worse to use split with arrays than simply index-access a string.
(It doesn't matter here, but be aware.)

In short; I don't think that efficiency is here an issue. That all
considered I'd *not* use the split method - which is, as mentioned,
not even standard with FS="" - to be on the safe side.

Janis

[*] A caveat e.g. in case of variable-width characters in arrays
used for string representation; it might be more expensive than it
appears at first glance.

[**] Performance issues usually arise on higher algorithmic levels.

Re: Unique Characters Only

<ui8bah$1ko5$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1698&group=comp.lang.awk#1698

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: mortonspam@gmail.com (Ed Morton)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Sun, 5 Nov 2023 09:11:13 -0600
Organization: A noiseless patient Spider
Lines: 103
Message-ID: <ui8bah$1ko5$1@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 5 Nov 2023 15:11:13 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="499fbea221ac1834cc2b299cfde571ca";
logging-data="54021"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uBfBUO6XpSSj2J8n9yWUL"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:R4/DDoZn1B8XPnUVNTPhvqvBw24=
Content-Language: en-US
X-Antivirus-Status: Clean
In-Reply-To: <ufbelp$1gcjp$1@dont-email.me>
X-Antivirus: Avast (VPS 231105-0, 11/4/2023), Outbound message
 by: Ed Morton - Sun, 5 Nov 2023 15:11 UTC

On 10/1/2023 4:38 AM, Mike Sanders wrote:
> run as...
>
> awk -f uniqueChars.awk
>
> output...
>
> Input string: Mary had a little lamb who's fleece was white as snow...
> Unique chars: Mary hdlitembwo'sfcn.
>
> script...
>
> BEGIN {
>
> a = "Mary had a little lamb who's fleece was white as snow..."
> b = uniqueChars(a)
>
> print "Input string: " a
> print "Unique chars: " b
>
> }
>
> function uniqueChars(str, x, y, c, tmp, uniqueStr) {
>
> y = length(str)
> uniqueStr = ""
> delete tmp # clear array for each new string
You don't need to do that `delete` - just having "tmp" listed in the
args list will re-init it every time the function is called. Removing
that statement will also make your script portable to awks than don't
support `delete array` (but most, possibly all, modern awks do support
that even though it's technically still undefined behavior).

>
> while(++x <= y) {
Using a `while` instead of `for` loop for that makes your code a bit
less clear, a bit more fragile (what if `x` gets set above?), and a bit
harder to maintain (what if in future you need to increment x by 2 every
iteration?). It's not worth saving the few characters over the
traditional `for ( x=1; x<=y; x++ )`

> c = substr(str, x, 1)
> if (!(c in tmp)) {
Idiomatically that'd be implemented as

if ( !tmp[c]++ ) {

and then you'd remove the `tmp[c]` below but the array in that case is
almost always named `seen[]` rather than `tmp[]`.

> uniqueStr = uniqueStr c
> tmp[c]
> }
> }
>
> return uniqueStr
>
> }
>
Alternatively, if the order of the characters returned doesn't matter,
you could do:

function uniqueChars(str, x, y, c, tmp, uniqueStr) {

y = length(str)
uniqueStr = ""
for ( x=1; x<=y; x++ ) {
tmp[substr(str,x,1)]
}
for ( c in tmp ) {
uniqueStr = uniqueStr c
}

return uniqueStr

}

I don't expect that to be any faster or anything, it's just different,
but if you have GNU awk then it can be tweaked to:

function uniqueChars(str, x, y, c, tmp, uniqueStr) {

y = length(str)
uniqueStr = ""
for ( x=1; x<=y; x++ ) {
tmp[substr(str,x,1)]
}
PROCINFO["sorted_in"] = "@ind_str_asc"
for ( c in tmp ) {
uniqueStr = uniqueStr c
}

return uniqueStr

}

and then it'll return the unique characters sorted in alphabetic order
which may be useful.

Regards,

Ed.

Re: Unique Characters Only

<ui9ltr$barj$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=1711&group=comp.lang.awk#1711

  copy link   Newsgroups: comp.lang.awk
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: porkchop@invalid.foo (Mike Sanders)
Newsgroups: comp.lang.awk
Subject: Re: Unique Characters Only
Date: Mon, 6 Nov 2023 03:18:19 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 86
Sender: Mike Sanders <busybox@sdf.org>
Message-ID: <ui9ltr$barj$2@dont-email.me>
References: <ufbelp$1gcjp$1@dont-email.me> <ui8bah$1ko5$1@dont-email.me>
Injection-Date: Mon, 6 Nov 2023 03:18:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="bea9002c59825590baf04d6c11b58fc5";
logging-data="371571"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wWg/TVxEeVcqmoqf9gLDM"
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock: sha1:xCeacasD44AX3XsUl6B0c4pCb+g=
 by: Mike Sanders - Mon, 6 Nov 2023 03:18 UTC

Ed Morton <mortonspam@gmail.com> wrote:

> You don't need to do that `delete` - just having "tmp" listed in the
> args list will re-init it every time the function is called. Removing
> that statement will also make your script portable to awks than don't
> support `delete array` (but most, possibly all, modern awks do support
> that even though it's technically still undefined behavior).

You know I wondered about that, thought I'd play it safe, but yeah,
noted: array always created anew, good to know.
>> while(++x <= y) {
> Using a `while` instead of `for` loop for that makes your code a bit
> less clear, a bit more fragile (what if `x` gets set above?), and a bit
> harder to maintain (what if in future you need to increment x by 2 every
> iteration?).

Aye.

> It's not worth saving the few characters over the
> traditional `for ( x=1; x<=y; x++ )`
>
>> c = substr(str, x, 1)
>> if (!(c in tmp)) {
> Idiomatically that'd be implemented as
>
> if ( !tmp[c]++ ) {
>
> and then you'd remove the `tmp[c]` below but the array in that case is
> almost always named `seen[]` rather than `tmp[]`.
>
>> uniqueStr = uniqueStr c
>> tmp[c]
>> }
>> }
>>
>> return uniqueStr
>>
>> }
>>
> Alternatively, if the order of the characters returned doesn't matter,
> you could do:
>
> function uniqueChars(str, x, y, c, tmp, uniqueStr) {
>
> y = length(str)
> uniqueStr = ""
> for ( x=1; x<=y; x++ ) {
> tmp[substr(str,x,1)]
> }
> for ( c in tmp ) {
> uniqueStr = uniqueStr c
> }
>
> return uniqueStr
>
> }
>
> I don't expect that to be any faster or anything, it's just different,
> but if you have GNU awk then it can be tweaked to:
>
> function uniqueChars(str, x, y, c, tmp, uniqueStr) {
>
> y = length(str)
> uniqueStr = ""
> for ( x=1; x<=y; x++ ) {
> tmp[substr(str,x,1)]
> }
> PROCINFO["sorted_in"] = "@ind_str_asc"
> for ( c in tmp ) {
> uniqueStr = uniqueStr c
> }
>
> return uniqueStr
>
> }
>
> and then it'll return the unique characters sorted in alphabetic order
> which may be useful.

Must add these examples to my notes.

--
:wq
Mike Sanders

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor