Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

The truth of a proposition has nothing to do with its credibility. And vice versa.


computers / news.software.nntp / Re: Encoding madness

SubjectAuthor
* Encoding madnessNigel Reed
+* Re: Encoding madnessRichard Kettlewell
|+- Re: Encoding madnessFranck
|+* Re: Encoding madnessAdam H. Kerman
||`* Re: Encoding madnessOlivier Miakinen
|| `- Re: Encoding madnessAdam H. Kerman
|`- Re: Encoding madnessRichard
+* Re: Encoding madnessFranck
|`- Re: Encoding madnessFranck
+* Re: Encoding madnessJulien ÉLIE
|+- Re: Encoding madnessFranck
|+* Re: Encoding madnessAdam W.
||`* Re: Encoding madnessRuss Allbery
|| `* Re: Encoding madnessAdam W.
||  `* Re: Encoding madnessAdam H. Kerman
||   `* Re: Encoding madnessAdam W.
||    `* Re: Encoding madnessAdam H. Kerman
||     `* Re: Encoding madnessRuss Allbery
||      `* Re: Encoding madnessAdam W.
||       +* Re: Encoding madnessRuss Allbery
||       |`- Re: Encoding madnessUrs Janßen
||       `- Re: Encoding madnessMichael Bäuerle
|+* Re: Encoding madnessNigel Reed
||`- Re: Encoding madnessTom Furie
|`- Re: Encoding madnessBilly G. (go-while)
+* Re: Encoding madnessnews
|+* Re: Encoding madnessJulien ÉLIE
||+* Re: Encoding madnessAdam H. Kerman
|||`* Re: Encoding madnessJulien ÉLIE
||| `* Re: Encoding madnessRuss Allbery
|||  `* Re: Encoding madnessAdam H. Kerman
|||   `* Re: Encoding madnessRuss Allbery
|||    `* Re: Encoding madnessAdam H. Kerman
|||     `* Re: Encoding madnessRuss Allbery
|||      `- Re: Encoding madnessAdam H. Kerman
||`- Re: Encoding madnessOlivier Miakinen
|`- Re: Encoding madnessRuss Allbery
+* Re: Encoding madnessRuss Allbery
|`* Re: Encoding madnessAdam H. Kerman
| `* Re: Encoding madnessRuss Allbery
|  `* Re: Encoding madnessAdam H. Kerman
|   `- Re: Encoding madnessJulien ÉLIE
`- Re: Encoding madnessJulien ÉLIE

Pages:12
Re: Encoding madness

<87fs95oy9h.fsf@hope.eyrie.org>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1651&group=news.software.nntp#1651

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!paganini.bofh.team!news.killfile.org!news.eyrie.org!.POSTED!not-for-mail
From: eagle@eyrie.org (Russ Allbery)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 09:34:18 -0700
Organization: The Eyrie
Message-ID: <87fs95oy9h.fsf@hope.eyrie.org>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<u14jqj$296$1$arnold@news.chmurka.net> <u14mq3$2nkma$1@dont-email.me>
<u166re$nl1$1$arnold@news.chmurka.net> <u168us$31asn$2@dont-email.me>
<871qkpqhyy.fsf@hope.eyrie.org> <u16idd$e96$1$arnold@news.chmurka.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: hope.eyrie.org;
logging-data="28110"; mail-complaints-to="news@eyrie.org"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:+GQXDFzkyM86bty+vQDCwydVr/Y=
 by: Russ Allbery - Wed, 12 Apr 2023 16:34 UTC

gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) writes:
> Russ Allbery <eagle@eyrie.org> wrote:

>> German has a standard scheme,

> Do you mean substituting umlauts with their Latin equivalents and adding
> "e"?

> ä = ae
> ö = oe
> ü = ue

> At least that's what I found:

Yeah, exactly.

I know there's a similar one for Scandinavian languages that uses
characters like { and } to stand in for characters that don't exist in
ASCII (I think because those keys on an English keyboard were in the same
location as the real letters on a Scandinavian keyboard), but this is now
obscure enough that my Google skills are failing me. Old-timers would
probably still recognize that encoding, but I think everyone just uses
UTF-8 now.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

Re: Encoding madness

<u166re$nl1$1$arnold@news.chmurka.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1652&group=news.software.nntp#1652

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!weretis.net!feeder8.news.weretis.net!news.chmurka.net!.POSTED.s.v.chmurka.net!not-for-mail
From: gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 12:06:06 -0000 (UTC)
Organization: news.chmurka.net
Message-ID: <u166re$nl1$1$arnold@news.chmurka.net>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <u13te4$ae2$1$arnold@news.chmurka.net> <87bkjuwd27.fsf@hope.eyrie.org> <u14jqj$296$1$arnold@news.chmurka.net> <u14mq3$2nkma$1@dont-email.me>
NNTP-Posting-Host: s.v.chmurka.net
Injection-Date: Wed, 12 Apr 2023 12:06:06 -0000 (UTC)
Injection-Info: news.chmurka.net; posting-account="arnold"; posting-host="s.v.chmurka.net:172.24.44.20";
logging-data="24225"; mail-complaints-to="abuse-news.(at).chmurka.net"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.32-v7+ (armv7l))
Cancel-Lock: sha1:Pg4P/5uQ78ShmlR0dXlKzztVWiE=
 by: Adam W. - Wed, 12 Apr 2023 12:06 UTC

Adam H. Kerman <ahk@chinet.com> wrote:

>>Well, to be honest, you lose some information, but it's very rare and can
>>usually be deduced from context.
>
> In a language that doesn't use the Latin alphabet? C'mon.

No, I'm only talking about Polish.

Re: Encoding madness

<AABkNvAMqVoAAAdh.A3.flnews@WStation5.stz-e.de>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1653&group=news.software.nntp#1653

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: michael.baeuerle@stz-e.de (Michael Bäuerle)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 19:53:16 +0200 (CEST)
Lines: 30
Message-ID: <AABkNvAMqVoAAAdh.A3.flnews@WStation5.stz-e.de>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <u14jqj$296$1$arnold@news.chmurka.net> <u14mq3$2nkma$1@dont-email.me> <u166re$nl1$1$arnold@news.chmurka.net> <u168us$31asn$2@dont-email.me> <871qkpqhyy.fsf@hope.eyrie.org> <u16idd$e96$1$arnold@news.chmurka.net>
Reply-To: Michael Bäuerle <michael.baeuerle@gmx.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net EfNfP9N3SmtOAAz7yHKtsgH3e63Wz5J3OguPIwmO6eFcEoweXh
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:1YWEEF1/hRCrD6ZumH9gbN+3RgE= sha256:8gyx2kWoP3zrLuPDIS87nPIQHyO57E4oRATfpIiZgHU= sha1:Arpc4JHWRakoxXey/N/14qVPwOg=
Injection-Date: Wed, 12 Apr 2023 17:53:16 -0000
User-Agent: flnews/1.2.0pre21 (for NetBSD)
 by: Michael Bäuerle - Wed, 12 Apr 2023 17:53 UTC

Adam W. wrote:
>
> [German has a standard scheme]
> ä = ae
> ö = oe
> ü = ue

Can be used in all cases for german.

Same for capital umlauts:

Ä = Ae (or AE)
Ö = Oe (or OE)
Ü = Ue (or UE)

> At least that's what I found:
>
> https://blogs.transparent.com/german/writing-the-letters-%E2%80%9Ca%E2%80%9D-%E2%80%9Co%E2%80%9D-and-%E2%80%9Cu%E2%80%9D-without-a-german-keyboard/

| Bräuche – Braeuche (costumes) and Bäuche – Baeuche (bellies)
^^^^^^^^
This is wrong. "Bräuche" means something like "conventions".
My dictionary says "customs" (sounds a bit similar compared to
"costumes").

> I also know that their ß (scharfes S) can be substituted with ss.

There are some cases for which this alters the meaning (sometimes "sz"
is used for them). Normally no problem if context is available.
If in doubt, use "ss".

Re: Encoding madness

<u168t2$31asn$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1654&group=news.software.nntp#1654

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: ahk@chinet.com (Adam H. Kerman)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 12:41:07 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <u168t2$31asn$1@dont-email.me>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <1681245447.bystand@zzo38computer.org> <u15pee$1aneb$1@news.trigofacile.com>
Injection-Date: Wed, 12 Apr 2023 12:41:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="595da4453ea6f81ea9a07e161a9c1d7b";
logging-data="3189655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LiUp1gY/88e79CKAq9UTvH2nzBzYLm1c="
Cancel-Lock: sha1:3PiuUWmzvO8n2iATObGg0nGzetU=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Adam H. Kerman - Wed, 12 Apr 2023 12:41 UTC

Julien <iulius@nom-de-mon-site.com.invalid> wrote:

>>. . .

>FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
>can use any encoding you want for them.

The newgroup or checkgroups messages could have MIME headers specifying
the character set but these won't survive processing, so a big text file
will have multiple unspecified encodings. Aargh.

Just stating the obvious here.

Re: Encoding madness

<u168us$31asn$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1655&group=news.software.nntp#1655

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: ahk@chinet.com (Adam H. Kerman)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 12:42:04 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <u168us$31asn$2@dont-email.me>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <u14jqj$296$1$arnold@news.chmurka.net> <u14mq3$2nkma$1@dont-email.me> <u166re$nl1$1$arnold@news.chmurka.net>
Injection-Date: Wed, 12 Apr 2023 12:42:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="595da4453ea6f81ea9a07e161a9c1d7b";
logging-data="3189655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fAbEkcNRUqorGVrBwZwt5UOU8lhJ6YZo="
Cancel-Lock: sha1:5DV0556p9+PFVctDObs0Mh8sLH4=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Adam H. Kerman - Wed, 12 Apr 2023 12:42 UTC

Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:
>Adam H. Kerman <ahk@chinet.com> wrote:

>>>Well, to be honest, you lose some information, but it's very rare and can
>>>usually be deduced from context.

>>In a language that doesn't use the Latin alphabet? C'mon.

>No, I'm only talking about Polish.

You are. Russ wasn't.

Re: Encoding madness

<u17e6q$jeu$1@nntp.de>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1656&group=news.software.nntp#1656

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!2.eu.feeder.erje.net!feeder.erje.net!nntp.de!.POSTED.akk21-int.akk.kit.edu!not-for-mail
From: urs@buil.tin.org (Urs Janßen)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 23:17:46 -0000 (UTC)
Organization: tin.org
Archive: no
Message-ID: <u17e6q$jeu$1@nntp.de>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <u14jqj$296$1$arnold@news.chmurka.net> <u14mq3$2nkma$1@dont-email.me> <u166re$nl1$1$arnold@news.chmurka.net> <u168us$31asn$2@dont-email.me> <871qkpqhyy.fsf@hope.eyrie.org> <u16idd$e96$1$arnold@news.chmurka.net> <87fs95oy9h.fsf@hope.eyrie.org>
Injection-Date: Wed, 12 Apr 2023 23:17:46 -0000 (UTC)
Injection-Info: nntp.de; posting-host="akk21-int.akk.kit.edu:2a00:1398:5:f602:cafe:cafe:cafe:21";
logging-data="19934"; mail-complaints-to="abuse@nntp.de"
User-Agent: tin/2.6.3-20230217 ("Pittyvaich") (Linux/4.19.0-23-amd64 (x86_64))
Cancel-Lock: sha1:W3gxuZez7fAd+be2GjQyQy5RBgI=
X-No-Archive: yes
X-No-HTML: yes
 by: Urs Janßen - Wed, 12 Apr 2023 23:17 UTC

In <87fs95oy9h.fsf@hope.eyrie.org> on Wed, 12 Apr 2023 18:34:18,
Russ Allbery wrote:
> I know there's a similar one for Scandinavian languages that uses
> characters like { and } to stand in for characters that don't exist in
> ASCII (I think because those keys on an English keyboard were in the same
> location as the real letters on a Scandinavian keyboard), but this is now
> obscure enough that my Google skills are failing me. Old-timers would
> probably still recognize that encoding, but I think everyone just uses
> UTF-8 now.

JFTR, see "Table 3" from http://bzr.tin.org/doc/iso2asc.txt

Re: Encoding madness

<u170sp$1bci$1@cabale.usenet-fr.net>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1657&group=news.software.nntp#1657

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!news.gegeweb.eu!gegeweb.org!usenet-fr.net!.POSTED!not-for-mail
From: om+news@miakinen.net (Olivier Miakinen)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Wed, 12 Apr 2023 21:30:33 +0200
Organization: There's no cabale
Lines: 27
Message-ID: <u170sp$1bci$1@cabale.usenet-fr.net>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<1681245447.bystand@zzo38computer.org> <u15pee$1aneb$1@news.trigofacile.com>
NNTP-Posting-Host: 200.89.28.93.rev.sfr.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8bit
X-Trace: cabale.usenet-fr.net 1681327833 44434 93.28.89.200 (12 Apr 2023 19:30:33 GMT)
X-Complaints-To: abuse@usenet-fr.net
NNTP-Posting-Date: Wed, 12 Apr 2023 19:30:33 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
Firefox/52.0 SeaMonkey/2.49.4
In-Reply-To: <u15pee$1aneb$1@news.trigofacile.com>
 by: Olivier Miakinen - Wed, 12 Apr 2023 19:30 UTC

Le 12/04/2023 10:17, Julien ÉLIE a écrit :
>
> We need an interoperable way to provide texts.
> Please note RFC 2277 (BCP 18) about charsets:
>
> Protocols MUST be able to use the UTF-8 charset, which consists of
> the ISO 10646 coded character set combined with the UTF-8 character
> encoding scheme, as defined in [10646] Annex R (published in
> Amendment 2), for all text.
>
> Protocols MAY specify, in addition, how to use other charsets or
> other character encoding schemes for ISO 10646, such as UTF-16, but
> lack of an ability to use UTF-8 is a violation of this policy; such a
> violation would need a variance procedure ([BCP9] section 9) with
> clear and solid justification in the protocol specification document
> before being entered into or advanced upon the standards track.
>
> For existing protocols or protocols that move data from existing
> datastores, support of other charsets, or even using a default other
> than UTF-8, may be a requirement. This is acceptable, but UTF-8
> support MUST be possible.

And RFC 2277 is a quarter of a century old (January 1998)

--
Olivier Miakinen

Re: Encoding madness

<u18980$1cs7j$1@news.trigofacile.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1658&group=news.software.nntp#1658

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.trigofacile.com!.POSTED.176-143-2-105.abo.bbox.fr!not-for-mail
From: iulius@nom-de-mon-site.com.invalid (Julien ÉLIE)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 08:59:12 +0200
Organization: Groupes francophones par TrigoFACILE
Message-ID: <u18980$1cs7j$1@news.trigofacile.com>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<87fs96wd4j.fsf@hope.eyrie.org> <u145ij$2mbgj$1@dont-email.me>
<877cuiw68y.fsf@hope.eyrie.org> <u14ecu$2mso6$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 13 Apr 2023 06:59:12 -0000 (UTC)
Injection-Info: news.trigofacile.com; posting-account="julien"; posting-host="176-143-2-105.abo.bbox.fr:176.143.2.105";
logging-data="1470707"; mail-complaints-to="abuse@trigofacile.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.9.1
Cancel-Lock: sha1:65OktNgOH009AcB4EwbIz804dhI= sha256:R1i5sDUHihcj/z+5HR3d1wBzNIEDbkY5mq7jQYEieeY=
sha1:68MDqktd3AH9uPIM5LcE5B3VjHk= sha256:J2OYcoK3TOV/vmIXUkat0C/6h1bFHt53RHlgJ8Qul1E=
In-Reply-To: <u14ecu$2mso6$5@dont-email.me>
 by: Julien ÉLIE - Thu, 13 Apr 2023 06:59 UTC

Hi Adam,

>> The goal of all of that machinery is that the hierarchy administrators
>> should be canonical for the newsgroups entries for their hierarchy.
>> Encoding is one of those things where we need to standardize in order to,
>> say, comply with the NNTP standard, but I'm not willing to make any other
>> editorial judgments because it gets into too much annoying work. So this
>> is something you should take up with the hierarchy administrators.
>
> I apologize for suggesting additional programming work for you. I change
> my request to asking for an amendment to your README in which you might
> urge a proponent or hierarchy administrator not to use UTF-8 punctuation
> for which ASCII punctuation would suffice, to avoid needlessly turning a
> description into UTF-8.

Wouldn't a 100% ASCII-encoded file fit your needs?
I've just generated this one with the Text::Unidecode Perl module:
http://usenet.trigofacile.com/hierarchies/data/newsgroups.ascii

Punctuations like French quotations marks («»), unbreakable spaces, etc.
are converted into ASCII, as well as of course any other characters.

ftp.isc.org could then make available both files (.utf8 and .ascii).

--
Julien ÉLIE

« Ils ont refusé une offre de Normand ?!? » (Astérix)

Re: Encoding madness

<u189ou$1cs7i$1@news.trigofacile.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1659&group=news.software.nntp#1659

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.trigofacile.com!.POSTED.176.143-2-105.abo.bbox.fr!not-for-mail
From: iulius@nom-de-mon-site.com.invalid (Julien ÉLIE)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 09:08:14 +0200
Organization: Groupes francophones par TrigoFACILE
Message-ID: <u189ou$1cs7i$1@news.trigofacile.com>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<1681245447.bystand@zzo38computer.org> <u15pee$1aneb$1@news.trigofacile.com>
<u168t2$31asn$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 13 Apr 2023 07:08:14 -0000 (UTC)
Injection-Info: news.trigofacile.com; posting-account="julien"; posting-host="176.143-2-105.abo.bbox.fr:176.143.2.105";
logging-data="1470706"; mail-complaints-to="abuse@trigofacile.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.9.1
Cancel-Lock: sha1:lDUBNHb+nGXxgImmjnQMJo333FM= sha256:2ekFRRoRw/kKSPsInmZ/2ilCci9fvdeS9/x5I2Np0uU=
sha1:pahQ0M/H1ITekL14ESma7sQ/Oqo= sha256:dHOyWIlvMcz8SBrBWsdGTtrJ2BHJ3DtFYy7HVpZ6qPE=
In-Reply-To: <u168t2$31asn$1@dont-email.me>
 by: Julien ÉLIE - Thu, 13 Apr 2023 07:08 UTC

Hi Adam,

>> FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
>> can use any encoding you want for them.
>
> The newgroup or checkgroups messages could have MIME headers specifying
> the character set but these won't survive processing, so a big text file
> will have multiple unspecified encodings. Aargh.

My sentence was not about the process of control messages (for which the
encoding in MIME headers are correctly parsed, and the descriptions
actually converted to UTF-8 for homogeneity purpose).
The descriptions of newsgroups for which control articles are sent end
up in UTF-8.

Besides, there's a /localencoding/ setting in control.ctl to
parameterize the resulting encoding. The default is UTF-8 but one may
change it to another encoding if he wants.
https://www.eyrie.org/~eagle/software/inn/docs/control.ctl.html

My sentence was just about the encoding of the newsgroups file; INN will
provide its contents as-is when being requested the descriptions. If it
has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
it will provide them as-is. It won't try to convert them on the fly.

--
Julien ÉLIE

« – Laissons-lui notre char et prenons le sien…
– Oui, ça nous dépannera… » (Astérix)

Re: Encoding madness

<874jpjx1d5.fsf@hope.eyrie.org>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1660&group=news.software.nntp#1660

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.killfile.org!news.eyrie.org!.POSTED!not-for-mail
From: eagle@eyrie.org (Russ Allbery)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 08:12:22 -0700
Organization: The Eyrie
Message-ID: <874jpjx1d5.fsf@hope.eyrie.org>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<1681245447.bystand@zzo38computer.org>
<u15pee$1aneb$1@news.trigofacile.com> <u168t2$31asn$1@dont-email.me>
<u189ou$1cs7i$1@news.trigofacile.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: hope.eyrie.org;
logging-data="17842"; mail-complaints-to="news@eyrie.org"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:7SlaBR+5K9IqdHp0Nn+4WvXU26s=
 by: Russ Allbery - Thu, 13 Apr 2023 15:12 UTC

Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> writes:

> My sentence was just about the encoding of the newsgroups file; INN will
> provide its contents as-is when being requested the descriptions. If it
> has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
> it will provide them as-is. It won't try to convert them on the fly.

The fundamental protocol problem here is that the LIST NEWSGROUPS command
has no way to convey an encoding, let alone a different encoding for every
line. It's all well and good for different hierarchies to use different
encodings and use appropriate MIME headers for their control messages to
convey that encoding; all of that would in theory work as expected. But,
at the end of that process, a given news server returns the whole thing in
response to LIST NEWSGROUPS, and it has to pick a single encoding for that
response.

It's not even about the storage, really. Yes, right now INN uses a single
big file, but it doesn't need to do that. In theory, it could use some
smarter storage mechanism that preserved the original encoding. But that
doesn't help because of the protocol; it still has to respond to LIST
NEWSGROUPS commands, and at that point the separate encodings don't help.

The only workable choices for a single encoding are ASCII and UTF-8;
everything else is much worse in terms of interoperability. ASCII is not
generally sufficient as soon as one gets too far from western Europe and,
truly, is not really sufficient for western European languages either;
while it may be possible to read French with stripped accent marks or
Spanish without tildes, it's annoying, sometimes ambiguous, and there's no
reason to put up with it in 2023. Hence UTF-8.

Given that, working backwards, sending hierarchy control messages in a
different encoding than UTF-8 (or ASCII, which is a UTF-8 subset) is
probably not the best approach. Even if the news software understands the
MIME headers properly and knows the encoding (which can be dubious), now
the content has to be recoded into UTF-8 by the news server anyway. While
this is a well-defined operation for most encodings, it adds another step
that can fail and another opportunity for something to go wrong.

The best results are likely to come from using UTF-8 end-to-end. This
also has the advantage of being the direction that computing is going
anyway. My understanding is that even Chinese domestic use is
increasingly UTF-8, although support for other encodings is still required
in some situations. (Chinese was a potential sticking point due to issues
with how Chinese, Japanese, and Korean were encoded in Unicode that's more
complex than is worth getting into.)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

Re: Encoding madness

<87zg7bvm1j.fsf@hope.eyrie.org>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1661&group=news.software.nntp#1661

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.killfile.org!news.eyrie.org!.POSTED!not-for-mail
From: eagle@eyrie.org (Russ Allbery)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 08:28:40 -0700
Organization: The Eyrie
Message-ID: <87zg7bvm1j.fsf@hope.eyrie.org>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<1681245447.bystand@zzo38computer.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: hope.eyrie.org;
logging-data="17842"; mail-complaints-to="news@eyrie.org"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:U1yAPwdCP3vZ61gmTcaYNx/kXCg=
 by: Russ Allbery - Thu, 13 Apr 2023 15:28 UTC

news@zzo38computer.org.invalid writes:

> My opinion is that newsgroup names should be purely ASCII (there are
> many benefits to this, and using non-ASCII characters in newsgroup names
> and domain names and commands and configuration files can cause many
> problems, including security issues (especially if any Unicode-based
> encoding is used; non-Unicode has less security issues, but still is not
> worth it to use non-ASCII in these cases), comparisons, input, etc).

It's very easy for someone who speaks English to say that newsgroup names
should be purely ASCII, but what we would be saying is that people who
only speak Japanese (or Chinese, or Russian, or...) should put up with
newsgroup names being opaque, incomprehensible blobs of foreign
characters. Imagine what Usenet would be like for you if every newsgroup
name was in Arabic (or, if you happen to read Arabic, Korean, or some
other language).

Historically, that is exactly what we have said. But I think that's sad.

The security problem is real, but honestly that's largely because Usenet
software is very old and is often written in languages that, if not
actually dying, are at least very stagnant. Handling encodings properly
in C is a pain, but that's because doing anything properly in C is a pain.
Every modern language comes with extremely well-tested libraries, and most
of them now make *not* dealing with Unicode very difficult; it just
happens automatically. The remaining non-coding problems are mostly about
homograph attacks, and that's not much of an issue with newsgroup names.

Using multiple encodings, as you say, definitely makes the problem worse,
since you can't simply reject all invalid UTF-8 very early on, since you
may instead be dealing with ISO-8859-1 or some other encoding.
Thankfully, there's no real reason to support anything other than UTF-8
now. The remaining question is whether Usenet software can cope in
practice, or whether, like DNS and email, we'll be forced into using
complicated ASCII-compatible encoding schemes. Experiments so far seemed
to indicate that native Usenet software support for UTF-8 newsgroup names
wasn't that bad.

I can't think of any other major Internet protocol, not even domain names,
that is still limited to ASCII. Newsgroup names are a sad outlier.

> (But, I really hate Unicode; it is full of problems, including Han
> unification and other complications; and it is a stateful character set
> even though the encoding is stateless. TRON character code is better in
> some ways (especially for Japanese text), and I have done some work
> using this.)

I hate the email message format (it should be something much less
ambiguous and machine-parsable), the RFC 2822 Date format, and RFC 2047
header encoding. The price of implementing protocols is that there will
always be parts of them you don't like because life is compromise.

> The RFC says that it should be UTF-8, but I think that this is a mistake
> in the design of the protocol.

Mistake or not, it's not going to change now.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

Re: Encoding madness

<u19bvf$127sd$4@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1662&group=news.software.nntp#1662

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ahk@chinet.com (Adam H. Kerman)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 16:52:00 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <u19bvf$127sd$4@dont-email.me>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <u168t2$31asn$1@dont-email.me> <u189ou$1cs7i$1@news.trigofacile.com> <874jpjx1d5.fsf@hope.eyrie.org>
Injection-Date: Thu, 13 Apr 2023 16:52:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e209180b4b1fb5c5735b4c45c4131c69";
logging-data="1122189"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19/vRaYI+8T/PvaV1dMaPpJbCri7HLYYMA="
Cancel-Lock: sha1:+v+5AikdeETFthatCLsZfuF9iUE=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Adam H. Kerman - Thu, 13 Apr 2023 16:52 UTC

Russ Allbery <eagle@eyrie.org> wrote:

>. . . (Chinese was a potential sticking point due to issues
>with how Chinese, Japanese, and Korean were encoded in Unicode that's more
>complex than is worth getting into.)

Are you talking about character codes for the glyphs common to all three
languages, the CJK set, or something else entirely? Also, wasn't there
something about China modernizing glyphs that were still being used by
the other two languages?

Re: Encoding madness

<87leivvfgi.fsf@hope.eyrie.org>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1663&group=news.software.nntp#1663

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.killfile.org!news.eyrie.org!.POSTED!not-for-mail
From: eagle@eyrie.org (Russ Allbery)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 10:50:53 -0700
Organization: The Eyrie
Message-ID: <87leivvfgi.fsf@hope.eyrie.org>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<u168t2$31asn$1@dont-email.me> <u189ou$1cs7i$1@news.trigofacile.com>
<874jpjx1d5.fsf@hope.eyrie.org> <u19bvf$127sd$4@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: hope.eyrie.org;
logging-data="17842"; mail-complaints-to="news@eyrie.org"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:leO9R1tuK+M8FckF7l+ciiy1UnM=
 by: Russ Allbery - Thu, 13 Apr 2023 17:50 UTC

"Adam H. Kerman" <ahk@chinet.com> writes:
> Russ Allbery <eagle@eyrie.org> wrote:

>> . . . (Chinese was a potential sticking point due to issues with how
>> Chinese, Japanese, and Korean were encoded in Unicode that's more
>> complex than is worth getting into.)

> Are you talking about character codes for the glyphs common to all three
> languages, the CJK set, or something else entirely? Also, wasn't there
> something about China modernizing glyphs that were still being used by
> the other two languages?

Yeah, I'm talking about the glyph unification problem. I forget how much
impact the traditional vs. simplified Chinese distinction has on the
Unicode encoding and whether some of those distinctions are also unified.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

Re: Encoding madness

<u19hq8$13b43$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1664&group=news.software.nntp#1664

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ahk@chinet.com (Adam H. Kerman)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 18:31:36 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <u19hq8$13b43$1@dont-email.me>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <874jpjx1d5.fsf@hope.eyrie.org> <u19bvf$127sd$4@dont-email.me> <87leivvfgi.fsf@hope.eyrie.org>
Injection-Date: Thu, 13 Apr 2023 18:31:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e209180b4b1fb5c5735b4c45c4131c69";
logging-data="1158275"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/V4+vytn+VdXG6XRcqwFxOPo7dn2xDHLU="
Cancel-Lock: sha1:3Ad+llv86zeVFE37gjlBr3Xhyj8=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Adam H. Kerman - Thu, 13 Apr 2023 18:31 UTC

Russ Allbery <eagle@eyrie.org> wrote:
>"Adam H. Kerman" <ahk@chinet.com> writes:
>>Russ Allbery <eagle@eyrie.org> wrote:

>>>. . . (Chinese was a potential sticking point due to issues with how
>>>Chinese, Japanese, and Korean were encoded in Unicode that's more
>>>complex than is worth getting into.)

>>Are you talking about character codes for the glyphs common to all three
>>languages, the CJK set, or something else entirely? Also, wasn't there
>>something about China modernizing glyphs that were still being used by
>>the other two languages?

>Yeah, I'm talking about the glyph unification problem. I forget how much
>impact the traditional vs. simplified Chinese distinction has on the
>Unicode encoding and whether some of those distinctions are also unified.

I didn't sit in on these meeting years ago like you, but the little I
know about Chinese is that the traditional glyphs represented words and
not letters; they represent letters in the other two languages.

I'm aware that the glyphs are combinations of strokes that are common to
other glyphs, and I often wondered if the strokes themselves and not the
final result should have been what was encoded. I don't know if in
handwriting the student was talk to draw the strokes in a certain order,
since order becomes important in representing the combined strokes for
the glyph.

I realize strokes aren't letter-equivalents in other languages.

The coding plane would have been a hell of a lot smaller.

Re: Encoding madness

<87fs93vcql.fsf@hope.eyrie.org>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1665&group=news.software.nntp#1665

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.killfile.org!news.eyrie.org!.POSTED!not-for-mail
From: eagle@eyrie.org (Russ Allbery)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 11:49:38 -0700
Organization: The Eyrie
Message-ID: <87fs93vcql.fsf@hope.eyrie.org>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<874jpjx1d5.fsf@hope.eyrie.org> <u19bvf$127sd$4@dont-email.me>
<87leivvfgi.fsf@hope.eyrie.org> <u19hq8$13b43$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: hope.eyrie.org;
logging-data="17842"; mail-complaints-to="news@eyrie.org"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:WP4ASJ8MBMxnQywHaOpoKCO+s30=
 by: Russ Allbery - Thu, 13 Apr 2023 18:49 UTC

"Adam H. Kerman" <ahk@chinet.com> writes:

> I didn't sit in on these meeting years ago like you, but the little I
> know about Chinese is that the traditional glyphs represented words and
> not letters; they represent letters in the other two languages.

I'm fairly sure this isn't true in general. To the extent that the same
basic glyphs are used in Japanese kanji, I believe that they are also
words, or at least not letters in the sense of the Latin alphabet.
Japanese *kana* uses some Chinese characters to represent syllables
instead of words, but kana is a supplemental writing system used in
addition to kanji.

(Disclaimer that I do not speak or read any of these languages. I just
have a long-standing amateur interest in character sets.)

Hangul for Korean is different, but I don't think Hangul characters were
unified with Chinese and Japanese. I believe the impact on Korean was on
hanja, which is not used for most words. Hangul doesn't look anything
like Chinese or Japanese characters, to such an extent that I, as someone
who doesn't know any of these languages, can distinguish between Hangul
and the other languages on sight.

> I'm aware that the glyphs are combinations of strokes that are common to
> other glyphs, and I often wondered if the strokes themselves and not the
> final result should have been what was encoded.

There was a fairly extensive discussion of this at the time, but they
decided against it for a bunch of reasons that I don't remember. I think
one of them was that the existing encodings of those languages did not do
this, and one of the goals of Unicode was to allow easy conversion from
and to existing character encodings.

> The coding plane would have been a hell of a lot smaller.

Yes, but the software would have been a hell of a lot more complicated,
and it's not clear that's a good tradeoff. Arabic is already a
substantial challenge to support, and its combining characters are much
simpler than the system that would be required for stroke encoding, IIRC.

(Admittedly, most of the challenge with Arabic is the right-to-left
directionality.)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

Re: Encoding madness

<u19kgp$13mhu$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1666&group=news.software.nntp#1666

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ahk@chinet.com (Adam H. Kerman)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Thu, 13 Apr 2023 19:17:45 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 7
Message-ID: <u19kgp$13mhu$1@dont-email.me>
References: <20230411014437.0aef1026@wibble.sysadmininc.com> <87leivvfgi.fsf@hope.eyrie.org> <u19hq8$13b43$1@dont-email.me> <87fs93vcql.fsf@hope.eyrie.org>
Injection-Date: Thu, 13 Apr 2023 19:17:45 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e209180b4b1fb5c5735b4c45c4131c69";
logging-data="1169982"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18gNexObL6QD389dLnFmHaIQWwTtt3cnAE="
Cancel-Lock: sha1:KxgaOC0rjZvvgGvwbF5gOWyNLKQ=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Adam H. Kerman - Thu, 13 Apr 2023 19:17 UTC

Russ Allbery <eagle@eyrie.org> wrote:

>>. . .

>I'm fairly sure this isn't true in general. . . .

All right; I'll look it up. Thanks.

Re: Encoding madness

<u1hnbm$1p02g$1@news.trigofacile.com>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=1670&group=news.software.nntp#1670

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.trigofacile.com!.POSTED.176.143-2-105.abo.bbox.fr!not-for-mail
From: iulius@nom-de-mon-site.com.invalid (Julien ÉLIE)
Newsgroups: news.software.nntp
Subject: Re: Encoding madness
Date: Sun, 16 Apr 2023 22:55:18 +0200
Organization: Groupes francophones par TrigoFACILE
Message-ID: <u1hnbm$1p02g$1@news.trigofacile.com>
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 16 Apr 2023 20:55:18 -0000 (UTC)
Injection-Info: news.trigofacile.com; posting-account="julien"; posting-host="176.143-2-105.abo.bbox.fr:176.143.2.105";
logging-data="1867856"; mail-complaints-to="abuse@trigofacile.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.10.0
Cancel-Lock: sha1:DzEUzRLXUUJh4CdWLFpsTWv23gc= sha256:wF6+yjjIvTU3pjbcveW5TfhcVjpMRoByhkcXTI+H/Yw=
sha1:79MNSmwNshtjN7s3BTO5wx7kxZE= sha256:vmRcoZUWc/mkoyAYD3SAkEjJhbOuW8xFjTpL8DElR2k=
In-Reply-To: <20230411014437.0aef1026@wibble.sysadmininc.com>
 by: Julien ÉLIE - Sun, 16 Apr 2023 20:55 UTC

Hi Nigel,

> I'm trying to sync up the active and newsgroups file from 15 peers and
> it's proving to be a bit of a challenge.

Apart from encoding issues, are there special cases that you would have
liked to achieve for your sync and merge?

We have 2 old scripts needing a bit of refresh. I had planned to have a
look at them for the INN 2.7.2 release (in late 2023 or 2024).
https://github.com/InterNetNews/inn/issues/39

# mkngfile - make a newsgroup description file from multiple sources
# # Jeremy Nixon <jeremy@exit109.com>
# $Id: mkngfile,v 1.1 1999/04/17 09:19:25 jeremy Exp $
# # This program creates a newsgroup description file, using one or
# several input files containing group descriptions. The resulting
# file will contain a description line for each group in your active
# file.
# # If the input contains multiple different descriptions for a group,
# the program will prompt interactively for which one to use; or, if
# the --noask option is given, one will be chosen arbitrarily. If a
# group has no description, $default_desc (below) will be used.
# # The output will be sent to stdout, or to the file specified with
# the --output (or -o) option.
# # Example - to run with your existing newsgroups file, a local copy
# of the ISC newsgroups file, and a directory containing checkgroups
# files with names like *.check, creating the new file as 'newfile':
# mkngfile -o newfile /news/etc/newsgroups newsgroups checkgroups/*.check
# # You can set the location of your active file below so you don't
# have to specify it on the command line.

=> Besides files as input, I would also add the possibility to sync from
hostnames (the program will then download their newsgroups files).

We'll also need a similar tool to merge several active files (note that
INN already has the actmerge utility, without any documentation, that
merges 2 active files).

Descriptions are then cleaned with:

# cleannewsgroups.pl
# Copyright 1997-1999 Arthur Hagen
# Remove duplicate (Moderated) comments
# Strip trailing spaces
# Keep only one description for a newsgroup
# Option to remove extra tabs or to pretty-print with several tabs
# Option to either sort the newsgroups file alphabetically or to have it
# in the same order as the active file.

It can warn when the encoding is not UTF-8 :-)

> The first bit is done, which is mainly getting rid of groups that have
> invalid names (those that end in a period, contain illegal characters,
> and the like).

Such checks can also be added to the script (with an option).

--
Julien ÉLIE

« Si, si, si… Avec des si, on mettrait Lutèce en amphore ! » (Vacancier)

Re: Encoding madness

<cNSPM.194530$Flj7.153650@fx13.ams4>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=2154&group=news.software.nntp#2154

  copy link   Newsgroups: news.software.nntp
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!fx13.ams4.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: NoZilla/3.11 (Hackint; Unicorn; rv:0.8.15) go-while/19720229
NewsRW/4.2.0
Subject: Re: Encoding madness
Newsgroups: news.software.nntp
References: <20230411014437.0aef1026@wibble.sysadmininc.com>
<u13bpj$188ar$3@news.trigofacile.com>
Content-Language: en-US
From: no-reply@no.spam (Billy G. (go-while))
Organization: github.com/go-while
In-Reply-To: <u13bpj$188ar$3@news.trigofacile.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 18
Message-ID: <cNSPM.194530$Flj7.153650@fx13.ams4>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Sun, 24 Sep 2023 08:52:24 UTC
Date: Sun, 24 Sep 2023 10:56:31 +0200
X-Received-Bytes: 1367
 by: Billy G. (go-while) - Sun, 24 Sep 2023 08:56 UTC

On 11.04.23 12:12, Julien ÉLIE wrote:
> FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
> file are here:
>   http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8
>
> It may facilitate your life :-)
>
> The conversions I found out to work are:
> - cn.* and han.* are encoded in gb18030;
> - fido7.*, medlux.* and relcom.* in koi8-r;
> - ukr.* in koi8-u;
> - nctu.*, ncu.* and tw.* in big5;
> - scout.forum.chinese and scout.forum.korean in big5;
> - eternal-september.*, fido.* and fr.* in utf-8;
> - all the others fit well in cp1252.

thanks!

Pages:12
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor