Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Vulcans worship peace above all. -- McCoy, "Return to Tomorrow", stardate 4768.3


computers / news.software.nntp / working with cnfs

SubjectAuthor
* working with cnfsNigel Reed
+- Re: working with cnfsRichard
+* Re: working with cnfsJulien ÉLIE
|`* Re: working with cnfsNigel Reed
| +- Re: working with cnfsRichard
| +- Re: working with cnfsJesse Rehmer
| `* Re: working with cnfsRuss Allbery
|  `- Re: working with cnfsJesse Rehmer
`* Re: working with cnfsAdam W.
 `* Re: working with cnfsNigel Reed
  `* Re: working with cnfsAdam W.
   `- Re: working with cnfsNigel Reed

1
working with cnfs

<20230417103454.440dbdc0@wibble.sysadmininc.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1671&group=news.software.nntp#1671

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!.POSTED.47.189.156.66!not-for-mail
From: sysop@endofthelinebbs.com (Nigel Reed)
Newsgroups: news.software.nntp
Subject: working with cnfs
Date: Mon, 17 Apr 2023 10:34:54 -0500
Organization: End Of The Line BBS
Message-ID: <20230417103454.440dbdc0@wibble.sysadmininc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: www.sysadmininc.com; posting-host="47.189.156.66";
logging-data="180463"; mail-complaints-to="usenet@www.sysadmininc.com"
X-Newsreader: Claws Mail 4.1.1git47 (GTK 3.24.33; x86_64-pc-linux-gnu)
 by: Nigel Reed - Mon, 17 Apr 2023 15:34 UTC

Hi all,

earlier this week, my news server ran out of inodes and stopped
accepting articles. I want to say a big thank you to Julien who helped
me get back up on and running by converting to CNFS. I never thought
I'd get enough articles to have a problem but obviously that wasn't the
case.

Now that I've switched, I have a couple of issues, and both are about
using grep to find articles. I'm hoping someone has done this sort of
thing before so I don't have to reinvent the wheel.

If I'm looking for something in particular, I could just go into the
spoo/aritlces directory and start grepping. This isn't possible with
cnfs. So the questions is:

1. How can I search the entire spool for a given phrase.

2. How can I search a given hierarchy recursively for a given phrase.

I'm sure I'm not the first person who wants to do this so hopeful
someone has a solution.

Thanks,
Nigel

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

Re: working with cnfs

<u1jueo$2gjli$1@news.xmission.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1672&group=news.software.nntp#1672

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: legalize+jeeves@mail.xmission.com (Richard)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 17:08:40 -0000 (UTC)
Organization: multi-cellular, biological
Sender: legalize+jeeves@mail.xmission.com
Message-ID: <u1jueo$2gjli$1@news.xmission.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
Reply-To: (Richard) legalize+jeeves@mail.xmission.com
Injection-Date: Mon, 17 Apr 2023 17:08:40 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:2607:fa18:0:beef::4";
logging-data="2641586"; mail-complaints-to="abuse@xmission.com"
X-Reply-Etiquette: No copy by email, please
Mail-Copies-To: never
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: legalize@shell.xmission.com (Richard)
 by: Richard - Mon, 17 Apr 2023 17:08 UTC

[Please do not mail me a copy of your followup]

Nigel Reed <sysop@endofthelinebbs.com> spake the secret code
<20230417103454.440dbdc0@wibble.sysadmininc.com> thusly:

>me get back up on and running by converting to CNFS.

Hey! I learned something today :). I was previously unfamiliar with
CNFS, but it makes perfect sense. I've been hacking on trn and
thinking of a similar in-memory data structure for all the text that
comes back to the news reader.

>If I'm looking for something in particular, I could just go into the
>spoo/aritlces directory and start grepping. This isn't possible with
>cnfs. So the questions is:
>
>1. How can I search the entire spool for a given phrase.
>
>2. How can I search a given hierarchy recursively for a given phrase.
>
>I'm sure I'm not the first person who wants to do this so hopeful
>someone has a solution.

This could be done with some shell scripts and netcat talking to your
news server's nntp port. I imagine there's probably something better
by now though.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Re: working with cnfs

<u1k2cf$1ran5$1@news.trigofacile.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1673&group=news.software.nntp#1673

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!rocksolid2!news.neodome.net!weretis.net!feeder8.news.weretis.net!news.trigofacile.com!.POSTED.176-143-2-105.abo.bbox.fr!not-for-mail
From: iulius@nom-de-mon-site.com.invalid (Julien ÉLIE)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 20:15:43 +0200
Organization: Groupes francophones par TrigoFACILE
Message-ID: <u1k2cf$1ran5$1@news.trigofacile.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 Apr 2023 18:15:43 -0000 (UTC)
Injection-Info: news.trigofacile.com; posting-account="julien"; posting-host="176-143-2-105.abo.bbox.fr:176.143.2.105";
logging-data="1944293"; mail-complaints-to="abuse@trigofacile.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.10.0
Cancel-Lock: sha1:Tm4PmdPWxqjCGxY9OZLRsf1+p2I= sha256:xvUmBEFdl2dkNPwuCeG+loW+2sKONL2vv6H5oSdB8bA=
sha1:7OvDTLPUcx/af3x2C2/yKKuezUw= sha256:E4hiraknS8D8uI5eVgx+kH4jS+CG8SKhCh7bGf5Jh6Y=
In-Reply-To: <20230417103454.440dbdc0@wibble.sysadmininc.com>
 by: Julien ÉLIE - Mon, 17 Apr 2023 18:15 UTC

Hi Nigel,

> 1. How can I search the entire spool for a given phrase.
>
> 2. How can I search a given hierarchy recursively for a given phrase.
>
> I'm sure I'm not the first person who wants to do this so hopeful
> someone has a solution.

I'm unfortunately not aware of such a tool for other storage methods
than tradindexed. Even timehash won't respond to the second point as
articles are not classified by hierarchy.

If someone has ever written a script to do that, I would happily add it
to INN.

I would otherwise just suggest to retrieve articles one by one from the
history file, and parse them.
To do that, take the last field of each line in the history field, and
give its value to "sm -q".

Example for the first article in the history file in pathdb:

% head -n1 history | cut -f3 | sm -q

You'll get the article on standard output. You could then grep in it
whatever you want.
You now need to iterate over each article.

Of course a more complex script should be written if your search has
several parameters (in a hierarchy, from someone, etc.).

--
Julien ÉLIE

« Ira furor breuis est. » (Horace)

Re: working with cnfs

<20230417155045.21198a0b@wibble.sysadmininc.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1674&group=news.software.nntp#1674

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!.POSTED.47.189.156.66!not-for-mail
From: sysop@endofthelinebbs.com (Nigel Reed)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 15:50:45 -0500
Organization: End Of The Line BBS
Message-ID: <20230417155045.21198a0b@wibble.sysadmininc.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
<u1k2cf$1ran5$1@news.trigofacile.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: www.sysadmininc.com; posting-host="47.189.156.66";
logging-data="398259"; mail-complaints-to="usenet@www.sysadmininc.com"
X-Newsreader: Claws Mail 4.1.1git47 (GTK 3.24.33; x86_64-pc-linux-gnu)
 by: Nigel Reed - Mon, 17 Apr 2023 20:50 UTC

On Mon, 17 Apr 2023 20:15:43 +0200
Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

> Hi Nigel,
>
> > 1. How can I search the entire spool for a given phrase.
> >
> > 2. How can I search a given hierarchy recursively for a given
> > phrase.
> >
> > I'm sure I'm not the first person who wants to do this so hopeful
> > someone has a solution.
>
> I'm unfortunately not aware of such a tool for other storage methods
> than tradindexed. Even timehash won't respond to the second point as
> articles are not classified by hierarchy.
>
> If someone has ever written a script to do that, I would happily add
> it to INN.
>
> I would otherwise just suggest to retrieve articles one by one from
> the history file, and parse them.
> To do that, take the last field of each line in the history field,
> and give its value to "sm -q".
>
> Example for the first article in the history file in pathdb:
>
> % head -n1 history | cut -f3 | sm -q
>
> You'll get the article on standard output. You could then grep in it
> whatever you want.
> You now need to iterate over each article.
>
> Of course a more complex script should be written if your search has
> several parameters (in a hierarchy, from someone, etc.).

At this point I have over 7 million articles. Do you have any idea how
long that is going to take? :)

Maybe I should have just got a new disk and formatted it with 3 times
the inodes!

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

Re: working with cnfs

<u1ke7c$634$1$arnold@news.chmurka.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1675&group=news.software.nntp#1675

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!news.chmurka.net!.POSTED.s.v.chmurka.net!not-for-mail
From: gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 21:37:49 -0000 (UTC)
Organization: news.chmurka.net
Message-ID: <u1ke7c$634$1$arnold@news.chmurka.net>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
NNTP-Posting-Host: s.v.chmurka.net
Injection-Date: Mon, 17 Apr 2023 21:37:49 -0000 (UTC)
Injection-Info: news.chmurka.net; posting-account="arnold"; posting-host="s.v.chmurka.net:172.24.44.20";
logging-data="6244"; mail-complaints-to="abuse-news.(at).chmurka.net"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.32-v7+ (armv7l))
Cancel-Lock: sha1:NrwzRf6ECRfNecL81KHvV6CmsFA=
 by: Adam W. - Mon, 17 Apr 2023 21:37 UTC

Nigel Reed <sysop@endofthelinebbs.com> wrote:

> 1. How can I search the entire spool for a given phrase.
>
> 2. How can I search a given hierarchy recursively for a given phrase.

I don't have a direct answer (other than what Julien said), but if the
search phrase is always the same, then you could add another newsfeeds
entry and feed all new articles to the program with "Tp". This program
(script) could do the grep and, for example, send you an email, or post
to a private group, or whatever (~15 years ago I used to have a similar
setup, reposting all replies to my posts to a private group, accessible
only by me -- it was more convenient to reply to them this way).

Now I'm using it to count articles on Polish newsgroups to produce results
like these:

http://news.chmurka.net/top15.php

The line is:

chmurka.postprocessor.pl\
:!*,pl.*,alt.pl.*\
:Tp:/usr/local/news/local/bin/post-processor-pl.sh %s

Now, as I think of it, it would be doable to do such grepping script by
writing a program that reads the CNFS file (the format has to be described
somewhere, or can be deduced from the source code) and maybe dumps its
contents to stdout (or searches for a phrase and dumps the whole article),
but I don't know of anything like that.

On the other hand, CNFS is a binary format, but posts are stored in a text
format, so maybe something like this will suffice?

strings cnfs-file | grep phrase
grep -a phrase cnfs-file

Re: working with cnfs

<u1kh69$2gu1l$2@news.xmission.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1676&group=news.software.nntp#1676

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: legalize+jeeves@mail.xmission.com (Richard)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 22:28:25 -0000 (UTC)
Organization: multi-cellular, biological
Sender: legalize+jeeves@mail.xmission.com
Message-ID: <u1kh69$2gu1l$2@news.xmission.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com> <u1k2cf$1ran5$1@news.trigofacile.com> <20230417155045.21198a0b@wibble.sysadmininc.com>
Reply-To: (Richard) legalize+jeeves@mail.xmission.com
Injection-Date: Mon, 17 Apr 2023 22:28:25 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:2607:fa18:0:beef::4";
logging-data="2652213"; mail-complaints-to="abuse@xmission.com"
X-Reply-Etiquette: No copy by email, please
Mail-Copies-To: never
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: legalize@shell.xmission.com (Richard)
 by: Richard - Mon, 17 Apr 2023 22:28 UTC

[Please do not mail me a copy of your followup]

Nigel Reed <sysop@endofthelinebbs.com> spake the secret code
<20230417155045.21198a0b@wibble.sysadmininc.com> thusly:

>At this point I have over 7 million articles. Do you have any idea how
>long that is going to take? :)

If this is something you're often doing, then I suggest you build some
sort of keyword index and use INN to automatically feed all incoming
articles into the indexer and in the background run an indexer on all
your existing articles to catch up the old references.

Then you would query your index for the keywords to find relevant
articles that might contain your whole phrase.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Re: working with cnfs

<20230417173141.795d85a5@wibble.sysadmininc.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1677&group=news.software.nntp#1677

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!.POSTED.47.189.156.66!not-for-mail
From: sysop@endofthelinebbs.com (Nigel Reed)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 17:31:41 -0500
Organization: End Of The Line BBS
Message-ID: <20230417173141.795d85a5@wibble.sysadmininc.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
<u1ke7c$634$1$arnold@news.chmurka.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: www.sysadmininc.com; posting-host="47.189.156.66";
logging-data="398259"; mail-complaints-to="usenet@www.sysadmininc.com"
X-Newsreader: Claws Mail 4.1.1git47 (GTK 3.24.33; x86_64-pc-linux-gnu)
 by: Nigel Reed - Mon, 17 Apr 2023 22:31 UTC

On Mon, 17 Apr 2023 21:37:49 -0000 (UTC)
gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) wrote:

> Nigel Reed <sysop@endofthelinebbs.com> wrote:
>
> > 1. How can I search the entire spool for a given phrase.
> >
> > 2. How can I search a given hierarchy recursively for a given
> > phrase.
>
> I don't have a direct answer (other than what Julien said), but if
> the search phrase is always the same, then you could add another
> newsfeeds entry and feed all new articles to the program with "Tp".
> This program (script) could do the grep and, for example, send you an
> email, or post to a private group, or whatever (~15 years ago I used
> to have a similar setup, reposting all replies to my posts to a
> private group, accessible only by me -- it was more convenient to
> reply to them this way).
>
> Now I'm using it to count articles on Polish newsgroups to produce
> results like these:
>
> http://news.chmurka.net/top15.php
>
> The line is:
>
> chmurka.postprocessor.pl\
> :!*,pl.*,alt.pl.*\
> :Tp:/usr/local/news/local/bin/post-processor-pl.sh %s
>
> Now, as I think of it, it would be doable to do such grepping script
> by writing a program that reads the CNFS file (the format has to be
> described somewhere, or can be deduced from the source code) and
> maybe dumps its contents to stdout (or searches for a phrase and
> dumps the whole article), but I don't know of anything like that.
>
> On the other hand, CNFS is a binary format, but posts are stored in a
> text format, so maybe something like this will suffice?
>
> strings cnfs-file | grep phrase
> grep -a phrase cnfs-file

Unfortunately the queries would be ad-hoc. I was looking at CPAN,
212,470 modules and nothing to work with CNFS.

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

Re: working with cnfs

<u1kj64$4vj$1@nnrp.usenet.blueworldhosting.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1678&group=news.software.nntp#1678

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!nnrp.usenet.blueworldhosting.com!.POSTED!not-for-mail
From: jesse.rehmer@blueworldhosting.com (Jesse Rehmer)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 23:02:28 -0000 (UTC)
Organization: BlueWorld Hosting Usenet (https://usenet.blueworldhosting.com)
Message-ID: <u1kj64$4vj$1@nnrp.usenet.blueworldhosting.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com> <u1k2cf$1ran5$1@news.trigofacile.com> <20230417155045.21198a0b@wibble.sysadmininc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=fixed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 Apr 2023 23:02:28 -0000 (UTC)
Injection-Info: nnrp.usenet.blueworldhosting.com;
logging-data="5107"; mail-complaints-to="usenet@blueworldhosting.com"
User-Agent: Usenapp for MacOS
Cancel-Lock: sha1:crCc5EovO9Qxd+Ad2jIq6TAWDsI= sha256:lpq3BlG15XfEFwzcJMEAn5D+Yzk9BcRfpfI7kkYUGEg=
sha1:XZIbmIaXyaUYMyvr27/YsZnx4uE= sha256:4dgbSN/3wzHZiAQbaUvw6Bgv3agNW+tOgO1QOlCOx5Y=
X-Usenapp: v1.26.6/d - Full License
 by: Jesse Rehmer - Mon, 17 Apr 2023 23:02 UTC

On Apr 17, 2023 at 3:50:45 PM CDT, "Nigel Reed" <sysop@endofthelinebbs.com>
wrote:

> At this point I have over 7 million articles. Do you have any idea how
> long that is going to take? :)
>
> Maybe I should have just got a new disk and formatted it with 3 times
> the inodes!

I weighed the pros/cons of CNFS and the issue you describe was one that kept
me from using it on my main box. If you want to stick with tradspool I
recommend using ZFS. On a relatively small (>1TB) disk I have somewhere near
280,000,000 articles in tradspool.

Re: working with cnfs

<87pm82az6n.fsf@hope.eyrie.org>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1679&group=news.software.nntp#1679

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.killfile.org!news.eyrie.org!.POSTED!not-for-mail
From: eagle@eyrie.org (Russ Allbery)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 16:04:00 -0700
Organization: The Eyrie
Message-ID: <87pm82az6n.fsf@hope.eyrie.org>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
<u1k2cf$1ran5$1@news.trigofacile.com>
<20230417155045.21198a0b@wibble.sysadmininc.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: hope.eyrie.org;
logging-data="11284"; mail-complaints-to="news@eyrie.org"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:eZ4nNSRKN/vBjkXRp9H7bG+NWIM=
 by: Russ Allbery - Mon, 17 Apr 2023 23:04 UTC

Nigel Reed <sysop@endofthelinebbs.com> writes:

> At this point I have over 7 million articles. Do you have any idea how
> long that is going to take? :)

grep is probably somewhat more optimized than sm when reading files, but
searching 7 million articles is just going to be slow no matter how you
retrieve the article. sm is retrieving the article in mostly similar ways
(mmap) to how grep is retrieving it.

Searching 7 million articles without an index is just going to be slow no
matter how you do it. This is why when search is an anticipated
operation, people pre-create the search index.

(There have been multiple proposals for a search capability in NNTP and
ways to integrate search into INN over the years, but none of them have
stuck, in part because the open source search tools keep changing and
everyone stops using the old ones.)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

Re: working with cnfs

<u1kku0$21ti$1@nnrp.usenet.blueworldhosting.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1680&group=news.software.nntp#1680

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!nnrp.usenet.blueworldhosting.com!.POSTED!not-for-mail
From: jesse.rehmer@blueworldhosting.com (Jesse Rehmer)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Mon, 17 Apr 2023 23:32:16 -0000 (UTC)
Organization: BlueWorld Hosting Usenet (https://usenet.blueworldhosting.com)
Message-ID: <u1kku0$21ti$1@nnrp.usenet.blueworldhosting.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com> <u1k2cf$1ran5$1@news.trigofacile.com> <20230417155045.21198a0b@wibble.sysadmininc.com> <87pm82az6n.fsf@hope.eyrie.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=fixed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 Apr 2023 23:32:16 -0000 (UTC)
Injection-Info: nnrp.usenet.blueworldhosting.com;
logging-data="67506"; mail-complaints-to="usenet@blueworldhosting.com"
User-Agent: Usenapp for MacOS
Cancel-Lock: sha1:765ws+IfXe1peerHnRxmXguN1a4= sha256:MUU0MdaTV8eScO4s56nwvJ0BEgPIPR3+Od9AVQVWPK8=
sha1:PTz9RkYu409xdpFRmjHzOfAQvHM= sha256:HAMm9QUuRJMzuSMMsamr7eOczqHsTLO9CRIe/HEaUmU=
X-Usenapp: v1.26.6/d - Full License
 by: Jesse Rehmer - Mon, 17 Apr 2023 23:32 UTC

On Apr 17, 2023 at 6:04:00 PM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

> Nigel Reed <sysop@endofthelinebbs.com> writes:
>
>> At this point I have over 7 million articles. Do you have any idea how
>> long that is going to take? :)
>
> grep is probably somewhat more optimized than sm when reading files, but
> searching 7 million articles is just going to be slow no matter how you
> retrieve the article. sm is retrieving the article in mostly similar ways
> (mmap) to how grep is retrieving it.
>
> Searching 7 million articles without an index is just going to be slow no
> matter how you do it. This is why when search is an anticipated
> operation, people pre-create the search index.
>
> (There have been multiple proposals for a search capability in NNTP and
> ways to integrate search into INN over the years, but none of them have
> stuck, in part because the open source search tools keep changing and
> everyone stops using the old ones.)

The scope is beyond me, but if anyone out there wants a large dataset to feed
into something like ElasticSearch, I can provide it (and probably the
hosting/infrastructure needs).

Re: working with cnfs

<u1lsmm$nuk$1$arnold@news.chmurka.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1681&group=news.software.nntp#1681

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!news.chmurka.net!.POSTED.s.v.chmurka.net!not-for-mail
From: gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Tue, 18 Apr 2023 10:51:02 -0000 (UTC)
Organization: news.chmurka.net
Message-ID: <u1lsmm$nuk$1$arnold@news.chmurka.net>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com> <u1ke7c$634$1$arnold@news.chmurka.net> <20230417173141.795d85a5@wibble.sysadmininc.com>
NNTP-Posting-Host: s.v.chmurka.net
Injection-Date: Tue, 18 Apr 2023 10:51:02 -0000 (UTC)
Injection-Info: news.chmurka.net; posting-account="arnold"; posting-host="s.v.chmurka.net:172.24.44.20";
logging-data="24532"; mail-complaints-to="abuse-news.(at).chmurka.net"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.32-v7+ (armv7l))
Cancel-Lock: sha1:BQ5gxcL/C6iwE7JiOmpYH86afkY=
 by: Adam W. - Tue, 18 Apr 2023 10:51 UTC

Nigel Reed <sysop@endofthelinebbs.com> wrote:

> Unfortunately the queries would be ad-hoc. I was looking at CPAN,
> 212,470 modules and nothing to work with CNFS.

If you're into Perl, you could see the source of cnfsstat. It parses
buffers directly.

I'm looking into include/inn/storage.h (it's used by storage manager,
frontends/sm.c). Seems there's a nice API to retrieve all articles one by
one:

ARTHANDLE *SMnext(ARTHANDLE *article, const RETRTYPE amount);

There's also a manual page in doc/man/libinnstorage.3 (man libinnstorage).
If you want to write something to retrieve articles from CNFS (or, in
general, from storage manager), you could start there.

"The SMnext function is similar in function to SMretrieve except that it
is intended for traversing the method's article store sequentially. To
start a query, SMnext should be called with a NULL pointer ARTHANDLE.
Then SMnext returns ARTHANDLE which should be used for the next query. If
a NULL pointer ARTHANDLE is returned, no articles are left to be queried.
If data of ARTHANDLE is NULL pointer or len of ARTHANDLE is 0, it
indicates the article may be corrupted and should be cancelled by
SMcancel. The data area indicated by ARTHANDLE should not be modified."

There's also some overview search function, but it's not for article
bodies, so it probably won't be useful.

One could write a program that retrieves articles one by one and performs
some operations on them -- for example, matching them with regex (there's
a regexec API for that, declared in regex.h) and if regex matches,
printing some information (storage token, or Message-ID, or whatever).
ARTHANDLE structure (defined in include/inn/storage.h) contains all
required data, including storage token.

Anyway, if you want to do a text search, it would be best if you made a
dictionary first (for example, linking all words to tokens that contain
them), but it would be better to use a ready-made indexer for that (there
has to be one)...

Re: working with cnfs

<20230418115617.19d08fb2@wibble.sysadmininc.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1682&group=news.software.nntp#1682

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!.POSTED.47.189.156.66!not-for-mail
From: sysop@endofthelinebbs.com (Nigel Reed)
Newsgroups: news.software.nntp
Subject: Re: working with cnfs
Date: Tue, 18 Apr 2023 11:56:17 -0500
Organization: End Of The Line BBS
Message-ID: <20230418115617.19d08fb2@wibble.sysadmininc.com>
References: <20230417103454.440dbdc0@wibble.sysadmininc.com>
<u1ke7c$634$1$arnold@news.chmurka.net>
<20230417173141.795d85a5@wibble.sysadmininc.com>
<u1lsmm$nuk$1$arnold@news.chmurka.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: www.sysadmininc.com; posting-host="47.189.156.66";
logging-data="569381"; mail-complaints-to="usenet@www.sysadmininc.com"
X-Newsreader: Claws Mail 4.1.1git47 (GTK 3.24.33; x86_64-pc-linux-gnu)
 by: Nigel Reed - Tue, 18 Apr 2023 16:56 UTC

On Tue, 18 Apr 2023 10:51:02 -0000 (UTC)
gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) wrote:

>
> Anyway, if you want to do a text search, it would be best if you made
> a dictionary first (for example, linking all words to tokens that
> contain them), but it would be better to use a ready-made indexer for
> that (there has to be one)...

I think you're right. I setup my cnfs files to be 5gb each and it took
about 10 minutes to grep for my name in one of them when piping the
output of strings, so I don't see any other way than creating an index
of some kind.

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor