Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

21 May, 2024: Computers section is temporarily disabled for maintenance. It will take several days before it's back.


devel / comp.lang.tcl / Re: tdom html mode

SubjectAuthor
* tdom html modesaitology9
+* Re: tdom html modeted@loft.tnolan.com (Ted Nolan
|`* Re: tdom html modesaitology9
| `* Re: tdom html modeted@loft.tnolan.com (Ted Nolan
|  `- Re: tdom html modesaitology9
`* Re: tdom html modeRolf Ade
 `- Re: tdom html modesaitology9

1
tdom html mode

<u28v8d$ueo3$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21891&group=comp.lang.tcl#21891

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: saitology9@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: tdom html mode
Date: Tue, 25 Apr 2023 12:31:08 -0400
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <u28v8d$ueo3$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 25 Apr 2023 16:31:09 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1aa181b031904aa3c3c78178b44de1a7";
logging-data="998147"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+qaIHEXz26s1pTNY4jIWd0"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:ebEAwJDvmiT7uALHd11d9KBc3jQ=
Content-Language: en-US
 by: saitology9 - Tue, 25 Apr 2023 16:31 UTC

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "<p>" tags in sequence for example,
it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

Re: tdom html mode

<kaqh9fFrlpaU1@mid.individual.net>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21892&group=comp.lang.tcl#21892

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.szaf.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: @ednolan (ted@loft.tnolan.com (Ted Nolan)
Newsgroups: comp.lang.tcl
Subject: Re: tdom html mode
Date: 25 Apr 2023 17:34:39 GMT
Organization: loft
Lines: 16
Message-ID: <kaqh9fFrlpaU1@mid.individual.net>
References: <u28v8d$ueo3$1@dont-email.me>
X-Trace: individual.net DALXG3Tt2TM7zgquhxdEVQcK0+mFe4vMekuQ2ZAcP359og+oEi
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:QLAXUmkgH2HgfU/OolYjNgYI7H8=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
 by: ted@loft.tnolan.com - Tue, 25 Apr 2023 17:34 UTC

In article <u28v8d$ueo3$1@dont-email.me>,
saitology9 <saitology9@gmail.com> wrote:
>It seems like tdom html parsing doesn't work well with partial html
>strings that don't necessarily include the full doctype/head/body/etc.
>tags. tdom seems to return nodes only for the first tag and not the
>rest; meaning that if there are two "<p>" tags in sequence for example,
>it processes only the first one.
>
>That is fine if this is the expected behavior but if not, what is the
>correct way to do this?

I find that I always have better results with tdom parsing if I use the
"-html5" option. Are you using that?
--
columbiaclosings.com
What's not in Columbia anymore..

Re: tdom html mode

<u296r4$vqbb$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21894&group=comp.lang.tcl#21894

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: saitology9@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: Re: tdom html mode
Date: Tue, 25 Apr 2023 14:40:35 -0400
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <u296r4$vqbb$1@dont-email.me>
References: <u28v8d$ueo3$1@dont-email.me> <kaqh9fFrlpaU1@mid.individual.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 25 Apr 2023 18:40:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7650db85cbda4689532e286ece70e41a";
logging-data="1042795"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18k8FnAIo04lBUW1WCCSYhN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:y4e/P78rfiQybKx0KVbp97uoS2M=
Content-Language: en-US
In-Reply-To: <kaqh9fFrlpaU1@mid.individual.net>
 by: saitology9 - Tue, 25 Apr 2023 18:40 UTC

On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:
>
> I find that I always have better results with tdom parsing if I use the
> "-html5" option. Are you using that?

no I am not. However, it doesnt recognize this option. I just reviewed
the tdom docs and there wasn't any mention of this option.

For reference, this is what I have:

% package req tdom
0.9.1

% dom parse -html "<p>hello</p> <p>there</p>"
domDoc010BC518

Re: tdom html mode

<kaqnomFsir8U1@mid.individual.net>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21897&group=comp.lang.tcl#21897

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: @ednolan (ted@loft.tnolan.com (Ted Nolan)
Newsgroups: comp.lang.tcl
Subject: Re: tdom html mode
Date: 25 Apr 2023 19:25:10 GMT
Organization: loft
Lines: 46
Message-ID: <kaqnomFsir8U1@mid.individual.net>
References: <u28v8d$ueo3$1@dont-email.me> <kaqh9fFrlpaU1@mid.individual.net> <u296r4$vqbb$1@dont-email.me>
X-Trace: individual.net A8GwUol8geduc1c5/pHWfwo1IuCHaRH2w3vFGI+bbDMjVHBZ7U
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:fguhEThtva+fzlqVJgafGkTVJwc=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
 by: ted@loft.tnolan.com - Tue, 25 Apr 2023 19:25 UTC

In article <u296r4$vqbb$1@dont-email.me>,
saitology9 <saitology9@gmail.com> wrote:
>On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:
>>
>> I find that I always have better results with tdom parsing if I use the
>> "-html5" option. Are you using that?
>
>no I am not. However, it doesnt recognize this option. I just reviewed
>the tdom docs and there wasn't any mention of this option.
>
>For reference, this is what I have:
>
>% package req tdom
>0.9.1
>
>% dom parse -html "<p>hello</p> <p>there</p>"
>domDoc010BC518
>
>
>

It's a compile option:

http://www.tdom.org/index.html/doc/trunk/doc/dom.html

-html5
This option is only available if tDOM was build with
--enable-html5. Try the featureinfo method if you need
to know if this feature is build in.

Mine (FreeBSD) has it:

===
ted@hotrod:~ % tclsh8.6
% package require tdom
0.9.1
% dom parse -html5 "<p>hello</p> <p>there</p>"
domDoc0x80097d140
===

That's not to say it would solve your problem, but as I say
I've had better luck with it.
--
columbiaclosings.com
What's not in Columbia anymore..

Re: tdom html mode

<u29b50$10gbl$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21898&group=comp.lang.tcl#21898

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: saitology9@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: Re: tdom html mode
Date: Tue, 25 Apr 2023 15:54:07 -0400
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <u29b50$10gbl$1@dont-email.me>
References: <u28v8d$ueo3$1@dont-email.me> <kaqh9fFrlpaU1@mid.individual.net>
<u296r4$vqbb$1@dont-email.me> <kaqnomFsir8U1@mid.individual.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 25 Apr 2023 19:54:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="50d2763eeb3f5eaea12077ef55041609";
logging-data="1065333"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kQhZLKXK0UgdDwrPQWfOU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:5d+9bjVbdJn3YWd2e5sc3WDEGkE=
Content-Language: en-US
In-Reply-To: <kaqnomFsir8U1@mid.individual.net>
 by: saitology9 - Tue, 25 Apr 2023 19:54 UTC

On 4/25/2023 3:25 PM, Ted Nolan <tednolan> wrote:
>
> It's a compile option:
>

Thank you very much for your help. My version is not built with this
option. At the moment, it is not worth the trouble pursuing this any
further but it is good to know the option exists.

Re: tdom html mode

<87wn1zmzl3.fsf@pointsman.de>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21899&group=comp.lang.tcl#21899

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.szaf.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rolf@pointsman.de (Rolf Ade)
Newsgroups: comp.lang.tcl
Subject: Re: tdom html mode
Date: Tue, 25 Apr 2023 23:17:12 +0200
Organization: Me
Lines: 54
Message-ID: <87wn1zmzl3.fsf@pointsman.de>
References: <u28v8d$ueo3$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net 55E+W2f0hNKOdqz4OcrZNgf0nLj9vlbZxbuKgRgmzxtmC3cBY=
Cancel-Lock: sha1:H2ScaMB8SLTyWsvcI54Be5SWhGA= sha1:5DZOE4ksyYXGnHodcMt8SpVZRRw=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
 by: Rolf Ade - Tue, 25 Apr 2023 21:17 UTC

saitology9 <saitology9@gmail.com> writes:
> It seems like tdom html parsing doesn't work well with partial html
> strings that don't necessarily include the full doctype/head/body/etc.
> tags. tdom seems to return nodes only for the first tag and not the
> rest; meaning that if there are two "<p>" tags in sequence for
> example, it processes only the first one.
>
> That is fine if this is the expected behavior but if not, what is the
> correct way to do this?

You'd better update to the current tdom 0.9.3 (which provides a solution
to your question).

While the -html5 parser (if it is build in; that requires the gumbo
HTML5 parser lib present at build time and the configure switch
--enable-html5) is very robust (digest nearly any tag soup) this may be
not the right thing for this problem, because that always insert a
single document root and inserts missing elements implied by the context
(as <head>, <tbody>, etc.).

You want to parse an HTML fragment like

"<p>hello</p> <p>there</p>"

But what DOM tree do you expect to get from that? That document or
fragment doesn't have a single root as HTML or XML have to. So if you
are fine with getting a DOM _forest_ instead of a DOM tree jus to:

package require tdom 0.9.3
dom parse -html -forest "<p>hello</p> <p>there</p>" doc
$doc asXML

This script returns this to me:

<p>hello</p>
<p>there</p>

tDOMs dom methods (and the xpath engine) works pretty fine with such a
"forest" and a natural way. It is just that you don't have the pattern

set root [$doc documentElement]

and you have all of your data as decendants of that one roots (remember,
you have a forest, not a tree).

The "other" root nodes beside the one you still get from [$doc
documentElement] are (next) siblings of that one. Or you can get all
the roots of your forest with [$doc childNodes]. Hope, this hints get
you started.

rolf

Re: tdom html mode

<u29kmu$11tko$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=21901&group=comp.lang.tcl#21901

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: saitology9@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: Re: tdom html mode
Date: Tue, 25 Apr 2023 18:37:17 -0400
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <u29kmu$11tko$1@dont-email.me>
References: <u28v8d$ueo3$1@dont-email.me> <87wn1zmzl3.fsf@pointsman.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 25 Apr 2023 22:37:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0aea4b39ed759f3277fa6b7f0faf2fb8";
logging-data="1111704"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AI6XeDY7Ajpy0TJIG+re7"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:+8qqrraxXpdqbXkQ/7CbMDEWhdw=
Content-Language: en-US
In-Reply-To: <87wn1zmzl3.fsf@pointsman.de>
 by: saitology9 - Tue, 25 Apr 2023 22:37 UTC

On 4/25/2023 5:17 PM, Rolf Ade wrote:
>
> But what DOM tree do you expect to get from that? That document or
> fragment doesn't have a single root as HTML or XML have to. So if you
> are fine with getting a DOM _forest_ instead of a DOM tree jus to:
>
> package require tdom 0.9.3
> dom parse -html -forest "<p>hello</p> <p>there</p>" doc
> $doc asXML
>

Dear Rolf, thank you. Yes, I wanted the "forest" option. I am aware of
the difference between a tree and a forest. I have two versions of tdom
(0.9.1 and 0.9.2) and they both return a single node for the plain parse
command. So despite me writing a recursive function to navigate the
node's children as well as its siblings, I was not getting the full data
out. In any case, this was more of a curiosity on my part and not based
on any need.

I will look to upgrade to tdom soon. Thanks for the heads up.


devel / comp.lang.tcl / Re: tdom html mode

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor