Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

It is easier to change the specification to fit the program than vice versa.


devel / comp.lang.python / Re: Mutating an HTML file with BeautifulSoup

SubjectAuthor
o Re: Mutating an HTML file with BeautifulSoupChris Angelico

1
Re: Mutating an HTML file with BeautifulSoup

<mailman.304.1660942882.20444.python-list@python.org>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=23684&group=comp.lang.python#23684

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: rosuav@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: Mutating an HTML file with BeautifulSoup
Date: Sat, 20 Aug 2022 07:01:09 +1000
Lines: 42
Message-ID: <mailman.304.1660942882.20444.python-list@python.org>
References: <CAPTjJmoFiJ4V-sfye5OU04=hpRRpWQ_nX0=C+RVQ4QBu5X80PA@mail.gmail.com>
<04D12A76-92D8-4584-AE6E-AD3072E438EE@barrys-emacs.org>
<CAPTjJmqKz1om04YVgaOgt9gtrqsGbU93s1OK1t6YhtTyLvF=ig@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 1vRaEHVihcd93kTgLbM7mAxyQ0Ysh/aPRM8lCORwdMoQ==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=QqP/MSbd;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.018
X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; '2022': 0.05; 'issue.':
0.05; 'aug': 0.07; 'programmer': 0.07; 'angelico': 0.09; 'bs4':
0.09; 'intelligent': 0.09; 'parse': 0.09; 'tags': 0.09; '2022,':
0.16; 'barry': 0.16; 'chrisa': 0.16; 'from:addr:rosuav': 0.16;
'from:name:chris angelico': 0.16; 'input.': 0.16; 'parsing': 0.16;
'ported': 0.16; 'recall': 0.16; 'tag,': 0.16; 'thru': 0.16;
'wrote:': 0.16; 'instead': 0.17; 'url': 0.19; 'to:addr:python-
list': 0.20; 'input': 0.21; 'sat,': 0.22; 'subject:file': 0.22;
'run': 0.23; 'object': 0.26; 'old': 0.27; 'chris': 0.28; 'output':
0.28; 'recently': 0.29; 'attempt': 0.31; 'seem': 0.31; 'think':
0.32; 'split': 0.32; 'to:name:python': 0.32; 'ton': 0.32;
'message-id:@mail.gmail.com': 0.32; "i'm": 0.33; 'same': 0.34;
'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34;
'meaning': 0.35; 'from:addr:gmail.com': 0.35; 'work,': 0.36;
'received:209.85': 0.37; 'hard': 0.37; 'others': 0.37; 'file':
0.38; 'way': 0.38; 'could': 0.38; 'thanks': 0.38; 'received:209':
0.39; 'received:209.85.208': 0.39; 'use': 0.39; 'best': 0.61;
'detail': 0.61; 'skip:b 10': 0.63; 'in.': 0.64; 'closing': 0.69;
'end,': 0.69; 'manually': 0.69; 'within': 0.69; 'site': 0.70;
'rules': 0.70; 'production': 0.71; 'trust': 0.71; 'html': 0.80;
'left': 0.83; 'became': 0.84; 'crossed': 0.84; 'inclined': 0.84;
'lines,': 0.84; 'want.': 0.84; 'form.': 0.91; 'loses': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc;
bh=K3zLgWy4rlmaFRER9pNSnzwzb/Nl7Xu4gMIlGNhA8qs=;
b=QqP/MSbdYWqw3GmGLCLMCPPddnBw9mVjqjxowAmHZdPdf/JW8sLcIP/nfRdc8DEsTU
VKYQMApac4+ls/+gnD1U2UWR5NVJMvcm93dPfgafHyrzbjMczlyH2AjmHKSX3rvKJ74j
aigaQj9ojhgaO3zEmW0f0G3sld2KNtgkKhQmOSOSpElp/KyeA8ZU+It/tdn67jir/YUQ
eRSbvG8P/9YBTRvfrhX/ZVtgCfN16FezSr6vbr4iUlnC3op8fOdmXUuXHKOkVGNzKbGa
oIyd8NEOTFK24O3Y9wYTPHKWIFH9Z+Elp51kd2UJCiaLMNkdvCKwm8ZcwERN1yX8ZLYD
dsJw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc;
bh=K3zLgWy4rlmaFRER9pNSnzwzb/Nl7Xu4gMIlGNhA8qs=;
b=YncDYrcHGoHNZLJbq+lZK925ClzE7txwyozcroz2zAtefag60Hk0GAuYx7fGMTWCpr
wXsVHYFJNwD94AAHxRCrjzGKmS5SIWrZP8iF8AR5aI7TbyD7x6pB/Qwl/ngZPvrP1CgI
66KpmNfJwug+vE94Ysu5EAammmjMteP9S87Q7Nwsy+ub0ImvZhjBSSEv4ypMAaYWt2Qc
SbUUNMpjXXFSZxLKOAy0xuob3IknAe+lOBjZiohg3JDZ/jorxeWZJmd3bLZQAdFFg9uF
sHNhYMFAPpQUtRcLlkT36awRUASNlWXIn6P7rFKYkRJzRHRa2DryH5T1GpEhrDKPff8r
HfEA==
X-Gm-Message-State: ACgBeo0hWD+zWQA45pJVSHPWPi4JFY3lBlXIjfCgunbeLS8NPQLBqlWS
hOTrD0Afltg6SUR091ws3hIl8H7Jvun6pkn9o1fVxbL+
X-Google-Smtp-Source: AA6agR6R3fKtZZjx7qfc018ahfJpOjjBB9YBTFRvDMFILQcBtQxvNapKyiH3Qw/F7ABG2LXfNQy8Z8Pv6a+yjTnUO1Q=
X-Received: by 2002:a05:6402:530d:b0:446:e22:cca2 with SMTP id
eo13-20020a056402530d00b004460e22cca2mr7209654edb.237.1660942880628; Fri, 19
Aug 2022 14:01:20 -0700 (PDT)
In-Reply-To: <04D12A76-92D8-4584-AE6E-AD3072E438EE@barrys-emacs.org>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmqKz1om04YVgaOgt9gtrqsGbU93s1OK1t6YhtTyLvF=ig@mail.gmail.com>
X-Mailman-Original-References: <CAPTjJmoFiJ4V-sfye5OU04=hpRRpWQ_nX0=C+RVQ4QBu5X80PA@mail.gmail.com>
<04D12A76-92D8-4584-AE6E-AD3072E438EE@barrys-emacs.org>
 by: Chris Angelico - Fri, 19 Aug 2022 21:01 UTC

On Sat, 20 Aug 2022 at 05:12, Barry <barry@barrys-emacs.org> wrote:
>
>
>
> > On 19 Aug 2022, at 19:33, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the detail of the input.
> I recently ported from very old bs to bs4 and hit the same issue.
> So no it will not output the same as went in.
>
> If you can trust the input to be parsed as xml, meaning all the rules of closing
> tags have been followed. Then I think you can parse and unparse thru xml to
> do what you want.
>

Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
well. Thanks for trying, anyhow.

So I'm left with a few options:

1) Give up on validation, give up on verification, and just run this
thing on the production site with my fingers crossed
2) Instead of doing an intelligent reconstruction, just str.replace()
one URL with another within the file
3) Split the file into lines, find the Nth line (elem.sourceline) and
str.replace that line only
4) Attempt to use elem.sourceline and elem.sourcepos to find the start
of the tag, manually find the end, and replace one tag with the
reconstructed form.

I'm inclined to the first option, honestly. The others just seem like
hard work, and I became a programmer so I could be lazy...

ChrisA


devel / comp.lang.python / Re: Mutating an HTML file with BeautifulSoup

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor