RetroBBS - comp.lang.python - Re: HTML extraction

Re: HTML extraction

<mailman.37.1638914237.15287.python-list@python.org>

https://www.rocksolidbbs.com/devel/article-flat.php?id=20747&group=comp.lang.python#20747

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: rosuav@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: HTML extraction
Date: Wed, 8 Dec 2021 08:57:06 +1100
Lines: 69
Message-ID: <mailman.37.1638914237.15287.python-list@python.org>
References: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>
<CAPTjJmp7F5M-R+yG763x8uEDFoVD_rUomDDHv8hsXFqUun20uA@mail.gmail.com>
<CALk2KRVGa-0DEwivNJbnDsLuyg8rJjAB=Vo5ZAW-J0FdswKuFQ@mail.gmail.com>
<CAPTjJmrXc-pLX+ttk=2J3582dvMizrMAmDBo_H7TbeK68ah8Zw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de wCubfXBXi/3MNDWkUTP37wQ2tJPB9EgrRjC9Vvo+lq4w==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=g5PRZWt2;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.026
X-Spam-Evidence: '*H*': 0.95; '*S*': 0.00; 'that?': 0.07; 'angelico':
0.09; 'bs4': 0.09; 'cc:addr:python-list': 0.09; 'reference:':
0.09; 'regex': 0.09; 'soap': 0.09; 'tags': 0.09; 'though.': 0.09;
'cc:no real name:2**0': 0.14; 'chrisa': 0.16; 'cope': 0.16;
'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
'html,': 0.16; 'library:': 0.16; 'lxml': 0.16; 'nodes': 0.16;
'soup': 0.16; 'wrote:': 0.16; 'python': 0.16;
'cc:addr:python.org': 0.20; 'language': 0.21; "i've": 0.22;
'code': 0.23; 'cc:2**0': 0.25; 'anyone': 0.25; 'object': 0.26;
"isn't": 0.27; 'chris': 0.28; 'thinking': 0.28; 'seem': 0.31;
'comment': 0.31; 'dec': 0.31; 'message-id:@mail.gmail.com': 0.32;
'but': 0.32; 'same': 0.34; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'definitely': 0.35; 'following':
0.35; 'from:addr:gmail.com': 0.35; 'people': 0.36; 'using': 0.37;
'received:209.85': 0.37; 'way': 0.38; 'could': 0.38;
'received:209': 0.39; 'text': 0.39; 'use': 0.39; 'wed,': 0.39;
'still': 0.40; 'hello,': 0.40; 'want': 0.40; 'best': 0.61; 'job.':
0.62; 'internal': 0.63; 'simply': 0.63; 'between': 0.63; 'chief':
0.64; 'mainly': 0.64; 'universal': 0.64; 'time.': 0.66; 'more,':
0.67; 'interested': 0.68; 'order': 0.69; 'advantages': 0.69;
'2021': 0.71; 'ability': 0.71; 'tools': 0.74; 'html': 0.80;
'hamilton': 0.84; 'obligatory': 0.84; 'pure,': 0.84; 'roland':
0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:cc:content-transfer-encoding;
bh=nIq3VBtvayvB6AsTb1X5bOSZ2dejkzDe7+l1/O+6BRs=;
b=g5PRZWt2YsQ6W0yd5o8yNa4PVS0DOlds9Y9gLUM14a42LRjqeWkE1bDDQ32Mzb/LMI
rSnQz9f8XhuL6cgB0hioOpJyie/fpyKJq6q4QZStvT6Xtjt/hGHaj45ToonvfjeD8MXu
h34uP9j6FzwUuxY9sWwYqKMzA/8Gho7gfsTX+pV+g+p/1dQPj72rzk0ArjDZ3MpKlbP6
JKe334uAgVwq/iDitnJz5wpNOHS6iIZqYjady1YflDWB1iO/e1aiwNV9PuRwOFTBxwmC
y+SIk3nXOpiyq4c+7e377B3mw7z9YMRN5TEIhKe+0TPsB5/etCp0R5jLhscEO0S21qwW
6QDg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:cc:content-transfer-encoding;
bh=nIq3VBtvayvB6AsTb1X5bOSZ2dejkzDe7+l1/O+6BRs=;
b=YqM9s7E+iO1DDluLMw5NwjlwgGYU2pUTPU9rnRuliaSbwIPVP48qlmUXpHFzo5Exxl
aALVsFLDM++1EyUYYK3SP2T826bRaLuAreG6iDO1bVWu+26FcfxPLZoDewckOFG7fCPl
TQmPO1EYFzxB+exjb/ID7j2FytpSnHjCQvinZ2NBujcizspbFvZWxgGPxc0MKh4bnBWl
EtF3S+YIvtEFp+cRHWWiff6nHUxciGEh6sYHYCTYqyMK3o6Uy7sfAIDBgmvX8fos1kYZ
FeOu7ignPzZRuaPHQY/RmDyr4qFdD7RnFcG0qBBg7c1PBWgEiQROiBLEoxi0Y0gQyMVp
D9LA==
X-Gm-Message-State: AOAM5328JJFcpO2taLYS0WVHLfSOVoJAxhzslQbbDsPtYalyBxmsvwBF
q41c5h1njXXGjZA6IT8V7HjaD1uxS+qvJQ2rYQMFS6vF
X-Google-Smtp-Source: ABdhPJzL66PdB8nR/kpmCHgbqPGayTU2kEJqnwuKKMHqDiPR91AiFWiOPcFqGDV3Y7SIc9SMBOenKPlLRfPFPGBhNO8=
X-Received: by 2002:a7b:cd0b:: with SMTP id f11mr10468159wmj.3.1638914236840;
Tue, 07 Dec 2021 13:57:16 -0800 (PST)
In-Reply-To: <CALk2KRVGa-0DEwivNJbnDsLuyg8rJjAB=Vo5ZAW-J0FdswKuFQ@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmrXc-pLX+ttk=2J3582dvMizrMAmDBo_H7TbeK68ah8Zw@mail.gmail.com>
X-Mailman-Original-References: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>
<CAPTjJmp7F5M-R+yG763x8uEDFoVD_rUomDDHv8hsXFqUun20uA@mail.gmail.com>
<CALk2KRVGa-0DEwivNJbnDsLuyg8rJjAB=Vo5ZAW-J0FdswKuFQ@mail.gmail.com>

by: Chris Angelico - Tue, 7 Dec 2021 21:57 UTC

On Wed, Dec 8, 2021 at 7:55 AM Roland Mueller
<roland.em0001@googlemail.com> wrote:
>
> Hello,
>
> ti 7. jouluk. 2021 klo 20.08 Chris Angelico (rosuav@gmail.com) kirjoitti:
>>
>> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
>> <juliushamilton100@gmail.com> wrote:
>> >
>> > Hey,
>> >
>> > Could anyone please comment on the purest way simply to strip HTML tags
>> > from the internal text they surround?
>> >
>> > I know Beautiful Soup is a convenient tool, but I’m interested to know what
>> > the most minimal way to do it would be.
>>
>> That's definitely the best and most general way, and would still be my
>> first thought most of the time.
>>
>> > People say you usually don’t use Regex for a second order language like
>> > HTML, so I was thinking about using xpath or lxml, which seem like very
>> > pure, universal tools for the job.
>> >
>> > I did find an example for doing this with the re module, though.
>> >
>> > Would it be fair to say that to just strip the tags, Regex is fine, but you
>> > need to build a tree-like object if you want the ability to select which
>> > nodes to keep and which to discard?
>>
>> Obligatory reference:
>>
>> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>>
>> > Can xpath / lxml do that?
>> >
>> > What are the chief differences between xpath / lxml and Beautiful Soup?
>> >
>>
>> I've never directly used lxml, mainly because bs4 offers all the same
>> advantages and more, with about the same costs. However, if you're
>> looking for a no-external-deps option, Python *does* include an HTML
>> parser in the standard library:
>>
>
> But isn't bs4 only for SOAP content?
> Can bs4 or lxml cope with HTML code that does not comply with XML as the following fragment?
>
> <p>A
> <p>B
> <hr>
>
> BR,
> Roland
>

Check out the bs4 docs for some of the things you can do with it :)

ChrisA

I do not fear computers. I fear the lack of them. -- Isaac Asimov

devel / comp.lang.python / Re: HTML extraction

devel / comp.lang.python / Re: HTML extraction