Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Debian is like Suse with yast turned off, just better. :) -- Goswin Brederlow


devel / comp.lang.python / Re: Mutating an HTML file with BeautifulSoup

SubjectAuthor
o Re: Mutating an HTML file with BeautifulSoupPeter Otten

1
Re: Mutating an HTML file with BeautifulSoup

<mailman.328.1661149810.20444.python-list@python.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=23723&group=comp.lang.python#23723

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: __peter__@web.de (Peter Otten)
Newsgroups: comp.lang.python
Subject: Re: Mutating an HTML file with BeautifulSoup
Date: Mon, 22 Aug 2022 08:30:00 +0200
Lines: 43
Message-ID: <mailman.328.1661149810.20444.python-list@python.org>
References: <CAPTjJmoFiJ4V-sfye5OU04=hpRRpWQ_nX0=C+RVQ4QBu5X80PA@mail.gmail.com>
<CAPQx2vc9TvZCnsRURZkYjKLakPXfybLL78q=V1KRCOEb7BYTKw@mail.gmail.com>
<CAPTjJmrE20GTxO5ZphQ+jv3Y7LiseufY=mtPzC9eGerfyWXMvg@mail.gmail.com>
<8b10c2c2-7c64-3dcb-6b0b-7aa2751aa625@web.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de iWmIif0NwLP6ipl9x0Db9AqMG83zH7Z8xJdcPz3S/1DQ==
Return-Path: <__peter__@web.de>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="1024-bit key; unprotected key"
header.d=web.de header.i=@web.de header.b=OCvz9s0R; dkim-adsp=pass;
dkim-atps=neutral
X-Spam-Status: OK 0.002
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'def': 0.04; '2022': 0.05;
'bunch': 0.05; 'groups.': 0.05; 'aug': 0.07; 'underlying': 0.07;
'angelico': 0.09; 'blank': 0.09; 'bs4': 0.09; 'diff': 0.09;
'scraping': 0.09; 'url:de': 0.09; 'import': 0.15; 'attributes':
0.16; 'from:addr:web.de': 0.16; 'lxml': 0.16; 'message-
id:@web.de': 0.16; 'mistaken,': 0.16; 'received:mout.web.de':
0.16; 'received:web.de': 0.16; 'soup': 0.16; 'thru': 0.16;
'wrote:': 0.16; 'probably': 0.17; 'round': 0.19; 'to:addr:python-
list': 0.20; "i've": 0.22; 'subject:file': 0.22; 'received:de':
0.23; "i'd": 0.24; 'space': 0.26; 'perform': 0.26; 'else': 0.27;
'>>>': 0.28; 'chris': 0.28; 'header:User-Agent:1': 0.30; 'said,':
0.32; 'unless': 0.32; 'but': 0.32; "i'm": 0.33; "didn't": 0.34;
'header:In-Reply-To:1': 0.34; 'invalid': 0.35; 'mon,': 0.36;
'those': 0.36; 'really': 0.37; 'class': 0.37; 'hard': 0.37;
'received:192.168': 0.37; 'file': 0.38; 'changes': 0.39; 'adding':
0.39; 'single': 0.39; 'enough': 0.39; 'ago': 0.39; 'files.': 0.40;
'want': 0.40; 'try': 0.40; 'should': 0.40; 'skip:b 20': 0.63;
'skip:r 20': 0.64; 'your': 0.64; 'prevent': 0.67; 'received:217':
0.67; 'manually': 0.69; 'skip:f 30': 0.71; 'success': 0.73;
'analyze': 0.75; 'favor': 0.76; 'html': 0.80; 'dozen': 0.84;
'trips': 0.84; 'fall': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de;
s=dbaedf251592; t=1661149802;
bh=hdykIV6RsZ5hb71HloA44r1NsI2K0cj9fCCPpKQYqfw=;
h=X-UI-Sender-Class:Date:Subject:To:References:From:In-Reply-To;
b=OCvz9s0RT4D2Tz3XjOIdBZj2AJCp3YCZDE238+m+ZK9T152UczLLy83BIC4OsUXxi
xz256SE2XvW7mx9f6MWwuQcdJIZGZbh6v8cA71DCRRtJH7gQEfcyvflxpVT7Xo7Obe
Eg8UqeuYa7oZYGKRnsIVSryUpNAE6Acv0YKonlWA=
X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9
User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Content-Language: en-US
In-Reply-To: <CAPTjJmrE20GTxO5ZphQ+jv3Y7LiseufY=mtPzC9eGerfyWXMvg@mail.gmail.com>
X-Provags-ID: V03:K1:gECWC+CeX9YSGZIH1CXx2Xls0+GIBIqKyN7u3uZq+dMjNiGRNZO
UXKxaS8UK1+PDzVLqUy3mTK0E5zyfcOpDg3ikRq1wlmxABb/Q3DV/OsDpyglGEt0o8HgHAN
0z/TxYxtwsXGTi/UnlKvnQoLFwX9iuAvgyi8zUhIjvPoE+h8E7QrrAkvoYLL64GjmqJiG3B
6CAJfQixlpzoBOLZjnlfQ==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1;V03:K0:ByN7H8PtdhA=:C/Ay5ASo9ktCHVJDoolzmi
r0C4kFsu4toGfvn7svGy53Uhm6Zc6u3pf1EfkV5e7d9NrTARFqET1vaww/+vVoHqqcR5o2Gg+
LXm5h5zrr0rxgnDcZhthRjritzHEarWqgGiKHBAf/+S0dYyt1YDhQJTAUHLZLHGYXRknNuGO4
0Mnofox6EQnnvImwhuflsLbFG9hYSyipKYv8bbd5Yzsv9h7j2+w6+Q5BaMfFG9Ttm3MtcRKZ0
2jedzxDYSInewnzp9mcPz3uqhQGgf1J64YNUy33jCzdAMBug0IWNHjc8BR+I9FnGZtAHCyCBI
HcI+oC/K8Fv14fWwC2jXOD/HJbpTmPnoC9YL0agY5+D3RdzPBtxq5nZaKRbRrbD6e+yOzNrg0
j/GsYONNklcbzPo4thQfjvRx6p6zjzBe0S6Ddk0wopjLxkmIJ8U/MRT7xWE1tTakMqKnKExbk
kUMhPmJMXz17JIrzq7FITtuKBT/YnieLXIKjvE2CIiH0p4S2lth6abYe/+kzujn7xMsc6/iPe
h2huLit6J0w0p9cNgwXEVhVpGIX4HrPSavcPiyemvICh0kFCqmuK81bec4axpzhMx1O4lvkYA
xETlzaCpnVJ6DRXQ14+ItJ1Rrd61IHwn4oKtBiqr/ReVzRbSTnqds3RoB7lZyrrjxauB2BsJY
SHuz0g/NyyCy3qg4w6s93qegeXT3GezHQ+N+h0C4xxHWp2ZOUNhLEPutUL4+AssCW3ekv2NDl
VEEC3g5cD0H5ZNw8xfbRGYaFKmme++SwjXnqzGVKLBrDb2UzdkRk64Ny+0MkkE6lpc2onw9K+
ziRCmZWbAxYxUSHmWxhESZMIasSaHmVk1rSr6DB226lFj4PTvTpYLbbYnR1aKKZeTj6015lma
FM44I/IZ8iMNFxBdL+Jt4QT50vpk/TD5GQisdjNG9EHjjOB0gkvjMX3MAYQaHEg+HHwmmf83C
QAMx6rzlUU281VFsQi9o8OmZAeXhUcM34ETbK3dblGmhG/DrW5Y8UVv35GVOFNkMCiGVJ0/MY
SigVCSNkYFfsICdzdBVImipYPWXEweucoHDe1bCLrJLMc4y1bCSAvcOH8l8VjC1lsByc9cPfH
tWyEh5nWlEPsW73Llby6mnXdcLPB1lnQFE1PXVY3Y0HiCrT7HmgMeobWA==
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <8b10c2c2-7c64-3dcb-6b0b-7aa2751aa625@web.de>
X-Mailman-Original-References: <CAPTjJmoFiJ4V-sfye5OU04=hpRRpWQ_nX0=C+RVQ4QBu5X80PA@mail.gmail.com>
<CAPQx2vc9TvZCnsRURZkYjKLakPXfybLL78q=V1KRCOEb7BYTKw@mail.gmail.com>
<CAPTjJmrE20GTxO5ZphQ+jv3Y7LiseufY=mtPzC9eGerfyWXMvg@mail.gmail.com>
 by: Peter Otten - Mon, 22 Aug 2022 06:30 UTC

On 22/08/2022 05:30, Chris Angelico wrote:
> On Mon, 22 Aug 2022 at 10:04, Buck Evan <buck.2019@gmail.com> wrote:
>>
>> I've had much success doing round trips through the lxml.html parser.
>>
>> https://lxml.de/lxmlhtml.html
>>
>> I ditched bs for lxml long ago and never regretted it.
>>
>> If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`.
>> Unless I'm mistaken, all such changes should fall into no more than a dozen groups.
>>
>
> Will this round-trip mutate every single file and reorder the tag
> attributes? Because I really don't want to manually eyeball all those
> changes.

Most certainly not. Reordering is a bs4 feature that is governed by a
formatter. You can easily prevent that attributes are reorderd:

>>> import bs4
>>> soup = bs4.BeautifulSoup("""<div beta="1" alpha="2"/>""")
>>> soup
<html><body><div alpha="2" beta="1"></div></body></html>
>>> class Formatter(bs4.formatter.HTMLFormatter):
def attributes(self, tag):
return [] if tag.attrs is None else list(tag.attrs.items())

>>> soup.decode(formatter=Formatter())
'<html><body><div beta="1" alpha="2"></div></body></html>'

Blank space is probably removed by the underlying html parser.
It might be possible to make bs4 instantiate the lxml.html.HTMLParser
with remove_blank_text=False, but I didn't try hard enough ;)

That said, for my humble html scraping needs I have ditched bs4 in favor
of lxml and its xpath capabilities.


devel / comp.lang.python / Re: Mutating an HTML file with BeautifulSoup

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor