Rocksolid Light

Welcome to RetroBBS

mail  files  register  newsreader  groups  login

Message-ID:  

Have you reconsidered a computer career?


computers / comp.os.linux.misc / [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage under the hood

SubjectAuthor
* [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbageMarioCCCP
`* Re: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage under theAndreas Kohlbach
 `- Re: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish orMarioCCCP

1
[Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage under the hood

<ufbi58$1gvb7$1@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=13750&group=comp.os.linux.misc#13750

  copy link   Newsgroups: comp.os.linux.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihiFrangereMentulam@libero.it (MarioCCCP)
Newsgroups: comp.os.linux.misc
Subject: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage
under the hood
Date: Sun, 1 Oct 2023 12:37:28 +0200
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <ufbi58$1gvb7$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Oct 2023 10:37:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4e3e8cc0264650430fd7c9dd67753cfd";
logging-data="1604967"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IaPoy2FLCEQoS2M1sFA7V"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:n5Txny9mdJo02e4bOwcpaF8f3Co=
Content-Language: en-GB, it-IT
 by: MarioCCCP - Sun, 1 Oct 2023 10:37 UTC

I am having to closely look inside uncompressed files .fodt
or .html generated by LibreOffice ... and they contain pure
garbage, at least in aged, heavily edited, long documents.
Substantially, the system seem designed to write a book in a
single shot without any editing at all, to produce clean
underlying code.

you can find tons of rubbish like this

<p>Lorem<i>Ipsum</i><i>arma</i>virumque</p>

<p><em>Lorem</em><em> </em>Ipsum</p>

and every possible conceivable mess of useless tags
(formatting in italic just a blank space), closing and
immediately reopening the same kind of tag, multiply
selecting the same fontface, font family, size, for adjacent
or even the same chunk of text.
Some shit like <span {text-transform: uppercase}>LOreM
IPSum</span>.
What the heck are they doing ???
Either use a totally SW uppercasing (rendered dynamically at
"view-time") or store WHOLLY actual uppercased characters,
but not a mix.
And what about formatting applied only to a blank space and
not to the text on the right/left ? A bold blank ? An italic
blank amidst unformatted text ? It is an evident ARTIFACT of
manual editing and improper selection. Such formatting
meaningless are "relics" of ancient editing.
It seems the editor is unaware of SEMANTICS.
Sometimes when one write they are untidy. They enter
directly all uppercase manually, some other they select some
text and use formatting button. Both situations should be
treated exactly the same way and produce clean content under
the hood, not such shit like mixed case. I know, such mess
is invisible after rendering, but i.g. SIGIL is complying of
invalid XML and refuses to import documents.

And why not "collapse" adjacent identical tags with no
content among them ? Closing and reopening one same tag is
crazy, produces an annoyingly complex and unreadable XML.
I used to be a genuine fan of LibreOffice ... well, now I am
slowly changing my mind that the .odf formats is stupid in
the sense that is not semantically clean or consistent, it
is also redundant and Epub pubblication SW turn up the nose.

Now, I've come to the conclusion that, at least for
TEMPORARY STATE, this rubbish could have a motivation : it
is easier to "journal" and so to enable extensive UNDO/REDO
features, and that it leaves the mess as is and to be able
to backtrace on the need. Otherwise a very heavy heuristic
or heavier journallyng, not to say "snapshot" whole states,
would be necessary.

Nevertheless I find it unacceptable that a proper CLEANUP
isn't even attempted ON SAVING. I don't mean autosave (which
is righteous to be transparent to the user and not have side
effects), but manual committing of all the edits done. When
the user saves the current state, imho do/undo are
unnecessary, so cleanup of the underlying code should be
"legit".

Now I am desperately looking for some post-mortem utilities
able to cleanup html / xml :( :(

Apologies for the burst .... I am going crazy to analyse the
code, both manually with regex and programmatically. Without
getting to nothing done :(

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage under the hood

<877co6rwov.fsf@usenet.ankman.de>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=13754&group=comp.os.linux.misc#13754

  copy link   Newsgroups: comp.os.linux.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ank@spamfence.net (Andreas Kohlbach)
Newsgroups: comp.os.linux.misc
Subject: Re: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage under the hood
Date: Sun, 01 Oct 2023 15:04:32 -0400
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <877co6rwov.fsf@usenet.ankman.de>
References: <ufbi58$1gvb7$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="0f2b2dacd4e0285c0ba736b01a7b3a90";
logging-data="2593243"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Rj3ttU7kpU6LrYRBikrjP"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:W56rRy5OaK0qasPNZUZU9aOBRA0=
sha1:5K70v7Z1xS+Anl0bSYDcecZKR3k=
 by: Andreas Kohlbach - Sun, 1 Oct 2023 19:04 UTC

On Sun, 1 Oct 2023 12:37:28 +0200, MarioCCCP wrote:
>
> I am having to closely look inside uncompressed files .fodt or .html
> generated by LibreOffice ... and they contain pure garbage, at least
> in aged, heavily edited, long documents.
> Substantially, the system seem designed to write a book in a single
> shot without any editing at all, to produce clean underlying code.
>
> you can find tons of rubbish like this
>
> <p>Lorem<i>Ipsum</i><i>arma</i>virumque</p>
>
> <p><em>Lorem</em><em> </em>Ipsum</p>
>
> and every possible conceivable mess of useless tags (formatting in
> italic just a blank space), closing and immediately reopening the same
> kind of tag, multiply selecting the same fontface, font family, size,
> for adjacent or even the same chunk of text.
> Some shit like <span {text-transform: uppercase}>LOreM IPSum</span>.
> What the heck are they doing ???
> Either use a totally SW uppercasing (rendered dynamically at
> "view-time") or store WHOLLY actual uppercased characters, but not a
> mix.

That was a lot to read. I didn't read all. But suggest to give "tidy" a
shot. A tool which attempts to clean up HTML.

https://linux.die.net/man/1/tidy

| Tidy reads HTML, XHTML and XML files and writes cleaned up markup. For
| HTML variants, it detects and corrects many common coding errors and
| strives to produce visually equivalent markup that is both W3C
| compliant and works on most browsers. A common use of Tidy is to
| convert plain HTML to XHTML. For generic XML files, Tidy is limited to
| correcting basic well-formedness errors and pretty printing.
--
Andreas

Re: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or garbage under the hood

<ufcn7f$2gkab$2@dont-email.me>

  copy mid

https://www.rocksolidbbs.com/computers/article-flat.php?id=13758&group=comp.os.linux.misc#13758

  copy link   Newsgroups: comp.os.linux.misc
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihiFrangereMentulam@libero.it (MarioCCCP)
Newsgroups: comp.os.linux.misc
Subject: Re: [Flame] Lamentatio : the LbO XML/HTML invisible rubbish or
garbage under the hood
Date: Sun, 1 Oct 2023 23:10:07 +0200
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <ufcn7f$2gkab$2@dont-email.me>
References: <ufbi58$1gvb7$1@dont-email.me> <877co6rwov.fsf@usenet.ankman.de>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Oct 2023 21:10:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4e3e8cc0264650430fd7c9dd67753cfd";
logging-data="2642251"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ul8U4Hz+4dL2b+cF1nlkY"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:XczF+0Mc+5/5MPber2l/cboNpZI=
In-Reply-To: <877co6rwov.fsf@usenet.ankman.de>
Content-Language: en-GB, it-IT
 by: MarioCCCP - Sun, 1 Oct 2023 21:10 UTC

On 01/10/23 21:04, Andreas Kohlbach wrote:
> On Sun, 1 Oct 2023 12:37:28 +0200, MarioCCCP wrote:
>>
>> I am having to closely look inside uncompressed files .fodt or .html
>> generated by LibreOffice ... and they contain pure garbage, at least
>> in aged, heavily edited, long documents.
>> Substantially, the system seem designed to write a book in a single
>> shot without any editing at all, to produce clean underlying code.
>>
>> you can find tons of rubbish like this
>>
>> <p>Lorem<i>Ipsum</i><i>arma</i>virumque</p>
>>
>> <p><em>Lorem</em><em> </em>Ipsum</p>
>>
>> and every possible conceivable mess of useless tags (formatting in
>> italic just a blank space), closing and immediately reopening the same
>> kind of tag, multiply selecting the same fontface, font family, size,
>> for adjacent or even the same chunk of text.
>> Some shit like <span {text-transform: uppercase}>LOreM IPSum</span>.
>> What the heck are they doing ???
>> Either use a totally SW uppercasing (rendered dynamically at
>> "view-time") or store WHOLLY actual uppercased characters, but not a
>> mix.
>
> That was a lot to read. I didn't read all. But suggest to give "tidy" a
> shot. A tool which attempts to clean up HTML.
>
> https://linux.die.net/man/1/tidy
>
> | Tidy reads HTML, XHTML and XML files and writes cleaned up markup. For
> | HTML variants, it detects and corrects many common coding errors and
> | strives to produce visually equivalent markup that is both W3C
> | compliant and works on most browsers. A common use of Tidy is to
> | convert plain HTML to XHTML. For generic XML files, Tidy is limited to
> | correcting basic well-formedness errors and pretty printing.

I won't miss it ! Tnx a lot, I am becoming desperate about
that task. GRAZIE !!!

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor