Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Spock: We suffered 23 casualties in that attack, Captain.


devel / comp.lang.python / Problem with accented characters in mailbox.Maildir()

SubjectAuthor
* Problem with accented characters in mailbox.Maildir()Chris Green
+- Re: Problem with accented characters in mailbox.Maildir()jak
`* Re: Problem with accented characters in mailbox.Maildir()Chris Green
 `* Re: Problem with accented characters in mailbox.Maildir()Chris Green
  `* Re: Problem with accented characters in mailbox.Maildir()jak
   `* Re: Problem with accented characters in mailbox.Maildir()Peter J. Holzer
    `* Re: Problem with accented characters in mailbox.Maildir()jak
     `* Re: Problem with accented characters in mailbox.Maildir()Peter J. Holzer
      `* Re: Problem with accented characters in mailbox.Maildir()jak
       `- Re: Problem with accented characters in mailbox.Maildir()jak

1
Problem with accented characters in mailbox.Maildir()

<fqlhij-fgft1.ln1@esprimo.zbmc.eu>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27536&group=comp.lang.python#27536

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: cl@isbd.net (Chris Green)
Newsgroups: comp.lang.python
Subject: Problem with accented characters in mailbox.Maildir()
Date: Sat, 6 May 2023 11:13:03 +0100
Lines: 46
Message-ID: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net WtNvwWIClzgdteZCUJA4nw2hCS1gVRH//cI07301yHIJyYdbQ=
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:BhbUmOqXkP3gvGzPfoxs461wl9A=
User-Agent: tin/2.6.2-20220130 ("Convalmore") (Linux/5.15.0-69-generic (x86_64))
 by: Chris Green - Sat, 6 May 2023 10:13 UTC

I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things

This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".

Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)

--
Chris Green
·

Re: Problem with accented characters in mailbox.Maildir()

<u35e88$2r8nd$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27537&group=comp.lang.python#27537

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Sat, 6 May 2023 13:38:49 +0200
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <u35e88$2r8nd$1@dont-email.me>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 6 May 2023 11:38:48 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9cf9e92821bd2a9dc4f849495b0e5773";
logging-data="2990829"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18L2dkLlcdlrFsn0Ly0br8K"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:myf3klbSf6fSy/tLoX2YPket0/Y=
In-Reply-To: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
 by: jak - Sat, 6 May 2023 11:38 UTC

Chris Green ha scritto:
> I have a custom mail filter in python that uses the mailbox package to
> open a mail message and give me access to the headers.
>
> So I have the following code to open each mail message:-
>
> #
> #
> # Read the message from standard input and make a message object from it
> #
> msg = mailbox.MaildirMessage(sys.stdin.buffer.read())
>
> and then later I have (among many other bits and pieces):-
>
> #
> #
> # test for string in Subject:
> #
> if searchTxt in str(msg.get("subject", "unknown")):
> do
> various
> things
>
>
> This works exactly as intended most of the time but occasionally a
> message whose subject should match the test is missed. I have just
> realised when this happens, it's when the Subject: has accented
> characters in it (this is from a mailing list about canals in France).
>
> So, for example, the latest case of this happening has:-
>
> Subject: aka Marne à la Saône (Waterways Continental Europe)
>
> where the searchTxt in the code above is "Waterways Continental Europe".
>
>
> Is there any way I can work round this issue? E.g. is there a way to
> strip out all extended characters from a string? Or maybe it's
> msg.get() that isn't managing to handle the accented string correctly?
>
> Yes, I know that accented characters probably aren't allowed in
> Subject: but I'm not going to get that changed! :-)
>
>

Hi,
you could try extracting the "Content-Type:charset" and then using it
for subject conversion:

subj = str(raw_subj, encoding='...')

Re: Problem with accented characters in mailbox.Maildir()

<9jrhij-o0st1.ln1@esprimo.zbmc.eu>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27538&group=comp.lang.python#27538

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: cl@isbd.net (Chris Green)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Sat, 6 May 2023 12:51:37 +0100
Lines: 14
Message-ID: <9jrhij-o0st1.ln1@esprimo.zbmc.eu>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net DzVJWoozzVyFlQAGRmUtGg/7PgHE9TzmCacHw0XY/2ZevnkLc=
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:vIYsN/FepDsl5zr7B8ptSa2/z/k=
User-Agent: tin/2.6.2-20220130 ("Convalmore") (Linux/5.15.0-69-generic (x86_64))
 by: Chris Green - Sat, 6 May 2023 11:51 UTC

A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.

Surely there's a way to handle this.

--
Chris Green
·

Re: Problem with accented characters in mailbox.Maildir()

<vquhij-tn2u1.ln1@esprimo.zbmc.eu>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27539&group=comp.lang.python#27539

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: cl@isbd.net (Chris Green)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Sat, 6 May 2023 13:46:55 +0100
Lines: 28
Message-ID: <vquhij-tn2u1.ln1@esprimo.zbmc.eu>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu> <9jrhij-o0st1.ln1@esprimo.zbmc.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net yLdzYfvlIwEdHF8+qJoMfQ0UYEOKgm2j+fq07fg+QgIzDM13k=
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:FzFJQSgs3aAAtyZtExvPTQ9DRMA=
User-Agent: tin/2.6.2-20220130 ("Convalmore") (Linux/5.15.0-69-generic (x86_64))
 by: Chris Green - Sat, 6 May 2023 12:46 UTC

Chris Green <cl@isbd.net> wrote:
> A bit more information, msg.get("subject", "unknown") does return a
> string, as follows:-
>
> Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
>
> So it's the 'searchTxt in msg.get("subject", "unknown")' that's
> failing. I.e. for some reason 'in' isn't working when the searched
> string has utf-8 characters.
>
> Surely there's a way to handle this.
>
.... and of course I now see the issue! The Subject: with utf-8
characters in it gets spaces changed to underscores. So searching for
'(Waterways Continental Europe)' fails.

I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.

Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)

--
Chris Green
·

Re: Problem with accented characters in mailbox.Maildir()

<u35o3o$2sp4t$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27543&group=comp.lang.python#27543

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Sat, 6 May 2023 16:27:04 +0200
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <u35o3o$2sp4t$1@dont-email.me>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu> <vquhij-tn2u1.ln1@esprimo.zbmc.eu>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 6 May 2023 14:27:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9cf9e92821bd2a9dc4f849495b0e5773";
logging-data="3040413"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19rQzK9yJH3wLBeJDA5HqvD"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:PSI/6rMfi1k/T26j8V/67oWk9do=
In-Reply-To: <vquhij-tn2u1.ln1@esprimo.zbmc.eu>
 by: jak - Sat, 6 May 2023 14:27 UTC

Chris Green ha scritto:
> Chris Green <cl@isbd.net> wrote:
>> A bit more information, msg.get("subject", "unknown") does return a
>> string, as follows:-
>>
>> Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
>>
>> So it's the 'searchTxt in msg.get("subject", "unknown")' that's
>> failing. I.e. for some reason 'in' isn't working when the searched
>> string has utf-8 characters.
>>
>> Surely there's a way to handle this.
>>
> ... and of course I now see the issue! The Subject: with utf-8
> characters in it gets spaces changed to underscores. So searching for
> '(Waterways Continental Europe)' fails.
>
> I'll either need to test for both versions of the string or I'll need
> to change underscores to spaces in the Subject: returned by msg.get().
> It's a long enough string that I'm searching for that I won't get any
> false positives.
>
>
> Sorry for the noise everyone, it's a typical case of explaining the
> problem shows one how to fix it! :-)
>

This is probably what you need:

import email.header

raw_subj =
'=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='

subj = email.header.decode_header(raw_subj)[0]

subj[0].decode(subj[1])

'aka Marne à la Saône (Waterways Continental Europe)'

Re: Problem with accented characters in mailbox.Maildir()

<mailman.36.1683571574.13552.python-list@python.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27571&group=comp.lang.python#27571

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.imp.ch!fu-berlin.de!uni-berlin.de!not-for-mail
From: hjp-python@hjp.at (Peter J. Holzer)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Mon, 8 May 2023 20:36:18 +0200
Lines: 103
Message-ID: <mailman.36.1683571574.13552.python-list@python.org>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu>
<vquhij-tn2u1.ln1@esprimo.zbmc.eu> <u35o3o$2sp4t$1@dont-email.me>
<20230508183618.iy3r67ifxshhe5kx@hjp.at>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
protocol="application/pgp-signature"; boundary="fcono7ety4y6w2lh"
X-Trace: news.uni-berlin.de 9ajl/I+eO1ESZB6yjCIL0galGuYtc/KkxznF81AgiFww==
Return-Path: <hjp-python@hjp.at>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'def': 0.04; 'content-
type:multipart/signed': 0.05; 'searching': 0.05; 'spaces': 0.07;
'string': 0.07; 'utf-8': 0.07; 'content-type:application/pgp-
signature': 0.09; 'elif': 0.09; 'else:': 0.09; 'filename:fname
piece:asc': 0.09; 'filename:fname piece:signature': 0.09;
'filename:fname:signature.asc': 0.09; 'rfc': 0.09; 'string,':
0.09; 'import': 0.15; '"creative': 0.16; '+0200,': 0.16; '__/':
0.16; 'challenge!"': 0.16; 'from:addr:hjp-python': 0.16;
'from:addr:hjp.at': 0.16; 'from:name:peter j. holzer': 0.16;
'hjp@hjp.at': 0.16; 'holzer': 0.16; 'indeed': 0.16; 'issue!':
0.16; 'none:': 0.16; 'properly.': 0.16; 'reality.': 0.16;
'stross,': 0.16; 'subject:characters': 0.16; 'subject:skip:m 10':
0.16; 'url-ip:212.17.106.137/32': 0.16; 'url-ip:212.17.106/24':
0.16; 'url-ip:212.17/16': 0.16; 'url:hjp': 0.16; '|_|_)': 0.16;
'wrote:': 0.16; 'python': 0.16; 'probably': 0.17; 'to:addr:python-
list': 0.20; 'maybe': 0.22; 'returns': 0.22; 'tools.': 0.22;
'(and': 0.25; 'bit': 0.27; 'chris': 0.28; 'sense': 0.28; 'module':
0.31; "doesn't": 0.32; '(this': 0.32; 'unless': 0.32; 'but': 0.32;
"i'll": 0.33; 'header:In-Reply-To:1': 0.34; 'understood': 0.35;
'change': 0.36; 'read': 0.38; 'necessary': 0.39; 'single': 0.39;
'wrote': 0.39; 'both': 0.40; 'provide': 0.60; 'received:212':
0.62; 'here': 0.62; 'subject': 0.63; 'email': 0.63; 'your': 0.64;
'years': 0.65; 'received:userid': 0.66; 'skip:e 20': 0.67;
'right': 0.68; 'track,': 0.69; 'url-ip:212/8': 0.69; 'deal': 0.73;
'relevant': 0.73; 'charset:iso-8859-1': 0.73; 'html': 0.80;
'returned': 0.81; 'characters': 0.84; 'decode': 0.84;
'received:at': 0.84; "skip:' 70": 0.84; 'skip:= 70': 0.84;
'exists': 0.91; 'word.': 0.91; 'green': 0.96
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <u35o3o$2sp4t$1@dont-email.me>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20230508183618.iy3r67ifxshhe5kx@hjp.at>
X-Mailman-Original-References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu>
<vquhij-tn2u1.ln1@esprimo.zbmc.eu>
<u35o3o$2sp4t$1@dont-email.me>
 by: Peter J. Holzer - Mon, 8 May 2023 18:36 UTC
Attachments: signature.asc (application/pgp-signature)

On 2023-05-06 16:27:04 +0200, jak wrote:
> Chris Green ha scritto:
> > Chris Green <cl@isbd.net> wrote:
> > > A bit more information, msg.get("subject", "unknown") does return a
> > > string, as follows:-
> > >
> > > Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?[...]
> > ... and of course I now see the issue! The Subject: with utf-8
> > characters in it gets spaces changed to underscores. So searching for
> > '(Waterways Continental Europe)' fails.
> >
> > I'll either need to test for both versions of the string or I'll need
> > to change underscores to spaces in the Subject: returned by msg.get().

You need to decode the Subject properly. Unfortunately the Python email
module doesn't do that for you automatically. But it does provide the
necessary tools. Don't roll your own unless you've read and understood
the relevant RFCs.

>
> This is probably what you need:
>
> import email.header
>
> raw_subj > '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
>
> subj = email.header.decode_header(raw_subj)[0]
>
> subj[0].decode(subj[1])
>
> 'aka Marne à la Saône (Waterways Continental Europe)'

You are an the right track, but that works only because the example
exists only of a single encoded word. This is not always the case (and
indeed not what the RFC recommends).

email.header.decode_header returns a *list* of chunks and you have to
process and concatenate all of them.

Here is a snippet from a mail to html converter I wrote a few years ago:

def decode_rfc2047(s):
if s is None:
return None
r = ""
for chunk in email.header.decode_header(s):
if chunk[1]:
try:
r += chunk[0].decode(chunk[1])
except LookupError:
r += chunk[0].decode("windows-1252")
except UnicodeDecodeError:
r += chunk[0].decode("windows-1252")
elif type(chunk[0]) == bytes:
r += chunk[0].decode('us-ascii')
else:
r += chunk[0]
return r

(this is maybe a bit more forgiving than the OP needs, but I had to deal
with malformed mails)

I do have to say that Python is extraordinarily clumsy in this regard.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Attachments: signature.asc (application/pgp-signature)
Re: Problem with accented characters in mailbox.Maildir()

<u3bo0t$3voum$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27580&group=comp.lang.python#27580

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Mon, 8 May 2023 23:02:18 +0200
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <u3bo0t$3voum$1@dont-email.me>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu> <vquhij-tn2u1.ln1@esprimo.zbmc.eu>
<u35o3o$2sp4t$1@dont-email.me> <20230508183618.iy3r67ifxshhe5kx@hjp.at>
<mailman.36.1683571574.13552.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 8 May 2023 21:02:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1940db9a7bcac4726e8894816000b715";
logging-data="4187094"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19RXZ9piQEkXfc5aecTcakH"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:P/6yMjuhqifO3y3LAjaeqhjiV2k=
In-Reply-To: <mailman.36.1683571574.13552.python-list@python.org>
 by: jak - Mon, 8 May 2023 21:02 UTC

Peter J. Holzer ha scritto:
> On 2023-05-06 16:27:04 +0200, jak wrote:
>> Chris Green ha scritto:
>>> Chris Green <cl@isbd.net> wrote:
>>>> A bit more information, msg.get("subject", "unknown") does return a
>>>> string, as follows:-
>>>>
>>>> Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> [...]
>>> ... and of course I now see the issue! The Subject: with utf-8
>>> characters in it gets spaces changed to underscores. So searching for
>>> '(Waterways Continental Europe)' fails.
>>>
>>> I'll either need to test for both versions of the string or I'll need
>>> to change underscores to spaces in the Subject: returned by msg.get().
>
> You need to decode the Subject properly. Unfortunately the Python email
> module doesn't do that for you automatically. But it does provide the
> necessary tools. Don't roll your own unless you've read and understood
> the relevant RFCs.
>
>>
>> This is probably what you need:
>>
>> import email.header
>>
>> raw_subj =
>> '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
>>
>> subj = email.header.decode_header(raw_subj)[0]
>>
>> subj[0].decode(subj[1])
>>
>> 'aka Marne à la Saône (Waterways Continental Europe)'
>
> You are an the right track, but that works only because the example
> exists only of a single encoded word. This is not always the case (and
> indeed not what the RFC recommends).
>
> email.header.decode_header returns a *list* of chunks and you have to
> process and concatenate all of them.
>
> Here is a snippet from a mail to html converter I wrote a few years ago:
>
> def decode_rfc2047(s):
> if s is None:
> return None
> r = ""
> for chunk in email.header.decode_header(s):
> if chunk[1]:
> try:
> r += chunk[0].decode(chunk[1])
> except LookupError:
> r += chunk[0].decode("windows-1252")
> except UnicodeDecodeError:
> r += chunk[0].decode("windows-1252")
> elif type(chunk[0]) == bytes:
> r += chunk[0].decode('us-ascii')
> else:
> r += chunk[0]
> return r
>
> (this is maybe a bit more forgiving than the OP needs, but I had to deal
> with malformed mails)
>
> I do have to say that Python is extraordinarily clumsy in this regard.
>
> hp
>

Thanks for the reply. In fact, I gave that answer because I did
not understand what the OP wanted to achieve. In addition, the
OP opened a second thread on the similar topic in which I gave a
more correct answer (subject: "What do these '=?utf-8?' sequences
mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").
I was interested in this thread because a few years ago I wrote a
program in C that sent, via email, the log file of an application
in the event that it crashed and I had created the attachment
based64, however at the time I did not know of the RFC2047
relating to the subject. In addition, investigating the needs of
the OP, I discovered that the MAME is not the only format used
to compose the subject. I found an example in a thread of same
days ago where the subject contained Arabic text (sender:
"Uhrda education <Fatmaelhlwany9@gmail.com>", date: "Wed, 03
May 2023 00:18:14 UTC"). This is the raw version of the subject:

=?UTF-8?B?2LTZh9in2K/YqSDYo9iu2LXYp9im2Yog2K7Yr9mF2Kkg2LnZhdmE2KfYoSDZhdi52KrZhQ==?=
=?UTF-8?B?2K8gI9in2YjZhtmE2KfZitmGINio2LHYs9mI2YUg2YXYrtmB2LbYqSDYrtmE2KfZhCDYtNmH2LEg2YU=?=
=?UTF-8?B?2KfZitmIMjAyMyDZhNmE2KfYs9iq2YHYs9in2LEg2YjYp9iq2LMgLyAwMDIwMTAwOTMwNjExMQ==?=

As you can see, the penultimate letter of the header is not a
'q' as in the OP message but it is a 'b' and the body of the
message is covered according to the base64. This made me think
that a library could not delegate to the programmer the burden of
managing all these exceptions, then I have further investigated
to discover that the library also provides the conversion
function beyond that of coding and this makes our labors vain:

----------
from email.header import decode_header, make_header

subject = make_header(decode_header( raw_subject )))
----------

This line of code correctly converts the message of the OP
and also the one with the text in Arabic.

I greet you with cordiality.

Re: Problem with accented characters in mailbox.Maildir()

<mailman.41.1683671249.13552.python-list@python.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27607&group=comp.lang.python#27607

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: hjp-python@hjp.at (Peter J. Holzer)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Wed, 10 May 2023 00:27:20 +0200
Lines: 118
Message-ID: <mailman.41.1683671249.13552.python-list@python.org>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu>
<vquhij-tn2u1.ln1@esprimo.zbmc.eu> <u35o3o$2sp4t$1@dont-email.me>
<20230508183618.iy3r67ifxshhe5kx@hjp.at>
<mailman.36.1683571574.13552.python-list@python.org>
<u3bo0t$3voum$1@dont-email.me>
<20230509222720.oax4zh3tz4sgk7br@hjp.at>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
protocol="application/pgp-signature"; boundary="2sstalik4qmwane7"
X-Trace: news.uni-berlin.de qjpBHS2oJN2qprN4gbB8cw3dYvB+Mf8YGp+dP3QfSN1A==
Return-Path: <hjp-python@hjp.at>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.000
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'def': 0.04; 'content-
type:multipart/signed': 0.05; 'is.': 0.05; 'searching': 0.05;
'thread': 0.05; '2023': 0.07; 'loop': 0.07; 'programmer': 0.07;
'spaces': 0.07; 'string': 0.07; 'utf-8': 0.07; 'content-
type:application/pgp-signature': 0.09; 'fact,': 0.09;
'filename:fname piece:asc': 0.09; 'filename:fname
piece:signature': 0.09; 'filename:fname:signature.asc': 0.09;
'rfc': 0.09; 'string,': 0.09; 'coding': 0.13; 'import': 0.15;
'that.': 0.15; '"creative': 0.16; '"what': 0.16; '+0200,': 0.16;
'__/': 0.16; 'challenge!"': 0.16; 'conversion': 0.16; 'from:addr
:hjp-python': 0.16; 'from:addr:hjp.at': 0.16; 'from:name:peter j.
holzer': 0.16; 'hjp@hjp.at': 0.16; 'holzer': 0.16; 'issue!': 0.16;
'missed': 0.16; 'none:': 0.16; 'reality.': 0.16; 'right.': 0.16;
'stross,': 0.16; 'subject:characters': 0.16; 'subject:skip:m 10':
0.16; 'url-ip:212.17.106.137/32': 0.16; 'url-ip:212.17.106/24':
0.16; 'url-ip:212.17/16': 0.16; 'url:hjp': 0.16; 'variant': 0.16;
'|_|_)': 0.16; 'wrote:': 0.16; 'python': 0.16; 'to:addr:python-
list': 0.20; 'returns': 0.22; 'library': 0.26; 'bit': 0.27;
'function': 0.27; 'chris': 0.28; 'sense': 0.28; 'think': 0.32;
'but': 0.32; "i'll": 0.33; '----------': 0.33; 'mean': 0.34;
'header:In-Reply-To:1': 0.34; 'change': 0.36; "it's": 0.37;
'could': 0.38; 'read': 0.38; 'thanks': 0.38; 'date:': 0.39;
'wrote': 0.39; 'still': 0.40; 'both': 0.40; 'want': 0.40;
'should': 0.40; 'gave': 0.61; 'format': 0.62; 'received:212':
0.62; 'here': 0.62; 'subject': 0.63; 'skip:m 20': 0.63; 'our':
0.64; 'saw': 0.65; 'similar': 0.65; 'years': 0.65;
'received:userid': 0.66; 'skip:e 20': 0.67; 'further': 0.69;
'generator': 0.69; 'url-ip:212/8': 0.69; 'addition,': 0.70;
'depending': 0.70; 'delegate': 0.76; 'handles': 0.76; 'produces':
0.76; 'html': 0.80; 'discover': 0.80; 'discovered': 0.80;
'returned': 0.81; 'characters': 0.84; 'decode': 0.84; 'messages,':
0.84; 'received:at': 0.84; 'skip:= 70': 0.84; 'subject.': 0.93;
'green': 0.96
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <u3bo0t$3voum$1@dont-email.me>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20230509222720.oax4zh3tz4sgk7br@hjp.at>
X-Mailman-Original-References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu>
<vquhij-tn2u1.ln1@esprimo.zbmc.eu>
<u35o3o$2sp4t$1@dont-email.me>
<20230508183618.iy3r67ifxshhe5kx@hjp.at>
<mailman.36.1683571574.13552.python-list@python.org>
<u3bo0t$3voum$1@dont-email.me>
 by: Peter J. Holzer - Tue, 9 May 2023 22:27 UTC
Attachments: signature.asc (application/pgp-signature)

On 2023-05-08 23:02:18 +0200, jak wrote:
> Peter J. Holzer ha scritto:
> > On 2023-05-06 16:27:04 +0200, jak wrote:
> > > Chris Green ha scritto:
> > > > Chris Green <cl@isbd.net> wrote:
> > > > > A bit more information, msg.get("subject", "unknown") does return a
> > > > > string, as follows:-
> > > > >
> > > > > Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?> > [...]
> > > > ... and of course I now see the issue! The Subject: with utf-8
> > > > characters in it gets spaces changed to underscores. So searching for
> > > > '(Waterways Continental Europe)' fails.
> > > >
> > > > I'll either need to test for both versions of the string or I'll need
> > > > to change underscores to spaces in the Subject: returned by msg.get().
[...]
> > >
> > > subj = email.header.decode_header(raw_subj)[0]
> > >
> > > subj[0].decode(subj[1])
[...]
> > email.header.decode_header returns a *list* of chunks and you have to
> > process and concatenate all of them.
> >
> > Here is a snippet from a mail to html converter I wrote a few years ago:
> >
> > def decode_rfc2047(s):
> > if s is None:
> > return None
> > r = ""
> > for chunk in email.header.decode_header(s):
[...]
> > r += chunk[0].decode(chunk[1])
[...]
> > return r
[...]
> >
> > I do have to say that Python is extraordinarily clumsy in this regard.
>
> Thanks for the reply. In fact, I gave that answer because I did
> not understand what the OP wanted to achieve. In addition, the
> OP opened a second thread on the similar topic in which I gave a
> more correct answer (subject: "What do these '=?utf-8?' sequences
> mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").

Right. I saw that after writing my reply. I should have read all
messages, not just that thread before replying.

> the OP, I discovered that the MAME is not the only format used
> to compose the subject.

Not sure what "MAME" is. If it's a typo for MIME, then the base64
variant of RFC 2047 is just as much a part of it as the quoted-printable
variant.

> This made me think that a library could not delegate to the programmer
> the burden of managing all these exceptions,

email.header.decode_header handles both variants, but it produces bytes
sequences which still have to be decoded to get a Python string.

> then I have further investigated to discover that the library also
> provides the conversion function beyond that of coding and this makes
> our labors vain:
>
> ----------
> from email.header import decode_header, make_header
>
> subject = make_header(decode_header( raw_subject )))
> ----------

Yup. I somehow missed that. That's a lot more convenient than calling
decode in a loop (or generator expression). Depending on what you want
to do with the subject you may have wrap that in a call to str(), but
it's still a one-liner.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Attachments: signature.asc (application/pgp-signature)
Re: Problem with accented characters in mailbox.Maildir()

<u3f3k8$ih4o$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27609&group=comp.lang.python#27609

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Wed, 10 May 2023 05:38:47 +0200
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <u3f3k8$ih4o$1@dont-email.me>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu> <vquhij-tn2u1.ln1@esprimo.zbmc.eu>
<u35o3o$2sp4t$1@dont-email.me> <20230508183618.iy3r67ifxshhe5kx@hjp.at>
<mailman.36.1683571574.13552.python-list@python.org>
<u3bo0t$3voum$1@dont-email.me> <20230509222720.oax4zh3tz4sgk7br@hjp.at>
<mailman.41.1683671249.13552.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 10 May 2023 03:38:48 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8c4f396e4c1b1981a6d6811abea3a782";
logging-data="607384"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191rwZCRflVA/sfysG+WOPu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:pgWibOCa4Gw6+WMPyG1KeToDuR8=
In-Reply-To: <mailman.41.1683671249.13552.python-list@python.org>
 by: jak - Wed, 10 May 2023 03:38 UTC

Peter J. Holzer ha scritto:
> Not sure what "MAME" is. If it's a typo for MIME, then the base64
> variant of RFC 2047 is just as much a part of it as the quoted-printable
> variant.

MAME is a wonderful collection of ancient video games:
<https://www.consoleroms.com/roms/mame>
Probably a Froidian labpsus directed my fingers on the keyboard ;-D
Sorry, I meant MIME.

Re: Problem with accented characters in mailbox.Maildir()

<u3f48i$ij7m$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=27610&group=comp.lang.python#27610

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: Problem with accented characters in mailbox.Maildir()
Date: Wed, 10 May 2023 05:49:38 +0200
Organization: A noiseless patient Spider
Lines: 7
Message-ID: <u3f48i$ij7m$1@dont-email.me>
References: <fqlhij-fgft1.ln1@esprimo.zbmc.eu>
<9jrhij-o0st1.ln1@esprimo.zbmc.eu> <vquhij-tn2u1.ln1@esprimo.zbmc.eu>
<u35o3o$2sp4t$1@dont-email.me> <20230508183618.iy3r67ifxshhe5kx@hjp.at>
<mailman.36.1683571574.13552.python-list@python.org>
<u3bo0t$3voum$1@dont-email.me> <20230509222720.oax4zh3tz4sgk7br@hjp.at>
<mailman.41.1683671249.13552.python-list@python.org>
<u3f3k8$ih4o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 10 May 2023 03:49:38 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8c4f396e4c1b1981a6d6811abea3a782";
logging-data="609526"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//9mS1UyIZjrMfCCRaeBGr"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:rIO2r5r8WL/br9IMvxnxUAxPU9g=
In-Reply-To: <u3f3k8$ih4o$1@dont-email.me>
 by: jak - Wed, 10 May 2023 03:49 UTC

jak ha scritto:
> Froidian labpsus

* Freudian lapsus *
(my fault that I have not checked but my compliments to the google
translator)


devel / comp.lang.python / Problem with accented characters in mailbox.Maildir()

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor