RetroBBS - comp.lang.python - Re: xml.etree and namespaces -- why?

Hi all,

For the impatient: Below the longish text is a fully self-contained Python
example that illustrates my problem.

I'm struggling to understand xml.etree's handling of namespaces. I'm trying to
parse an Inkscape document which uses several namespaces. From etree's
documentation:

If the XML input has namespaces, tags and attributes with prefixes in the
form prefix:sometag get expanded to {uri}sometag where the prefix is
replaced by the full URI.

Which means that given an Element e, I cannot directly access its attributes
using e.get() because in order to do that I need to know the URI of the
namespace. So rather than doing this (see example below):

label = e.get('inkscape:label')

I need to do this:

label = e.get('{' + uri_inkscape_namespace + '}label')

....which is the method mentioned in etree's docs:

One way to search and explore this XML example is to manually add the URI
to every tag or attribute in the xpath of a find() or findall().
[...]
A better way to search the namespaced XML example is to create a
dictionary with your own prefixes and use those in the search functions.

Good idea! Better yet, that dictionary or rather, its reverse, already exists,
because etree has used it to unnecessarily mangle the namespaces in the first
place. The documentation doesn't mention where it can be found, but we can
just use the 'xmlns:' attributes of the <svg> root element to rebuild it. Or
so I thought, until I found out that etree deletes exactly these attributes
before handing the <svg> element to the user.

I'm really stumped here. Apart from the fact that I think XML is bloated shit
anyway and has no place outside HTML, I just don't get the purpose of etree's
way of working:

1) Evaluate 'xmlns:' attributes of the <svg> element
2) Use that info to replace the existing prefixes by {uri}
3) Realizing that using {uri} prefixes is cumbersome, suggest to
the user to build their own prefix -> uri dictionary
to undo the effort of doing 1) and 2)
4) ...but witholding exactly the information that existed in the original
document by deleting the 'xmlns:' attributes from the <svg> tag

Why didn't they leave the whole damn thing alone? Keep <svg> intact and keep
the attribute 'prefix:key' literally as they are. For anyone wanting to use
the {uri} prefixes (why would they) they could have thrown in a helper
function for the prefix->URI translation.

I'm assuming that etree's designers knew what they were doing in order to make
my life easier when dealing with XML. Maybe I'm missing the forest for the
trees. Can anybody enlighten me? Thanks!

#### self-contained example
import xml.etree.ElementTree as ET

def test_svg(xml):
root = ET.fromstring(xml)
for e in root.iter():
print(e.tag) # tags are shown prefixed with {URI}
if e.tag.endswith('svg'):
# Since namespaces are defined inside the <svg> tag, let's use the info
# from the 'xmlns:' attributes to undo etree's URI prefixing
print('Element <svg>:')
for k, v in e.items():
print(' %s: %s' % (k, v))
# ...but alas: the 'xmlns:' attributes have been deleted by the parser

xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>

if __name__ == '__main__':
test_svg(xml)

Re: xml.etree and namespaces -- why?

<mailman.753.1666193050.20444.python-list@python.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=24375&group=comp.lang.python#24375

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: axy@declassed.art (Axy)
Newsgroups: comp.lang.python
Subject: Re: xml.etree and namespaces -- why?
Date: Wed, 19 Oct 2022 16:23:56 +0100
Lines: 136
Message-ID: <mailman.753.1666193050.20444.python-list@python.org>
References: <jrac70Frmk7U1@mid.individual.net>
<6715508f-e13c-c3e7-98d7-4fadda4420a6@declassed.art>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de 1PBC0OW0FF2knb8WPaIF6AkcCThggRyVyAMM+a4HmQ1Q==
Return-Path: <axy@declassed.art>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=declassed.art header.i=@declassed.art header.b=N8Q9Ifxc;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'this:': 0.03; 'def': 0.04;
'explicitly': 0.07; 'wanting': 0.07; "'''": 0.09; 'parse': 0.09;
'skip:x 10': 0.09; 'tags': 0.09; 'url:2000': 0.09; 'user.': 0.09;
'import': 0.15; 'problem.': 0.15; 'that.': 0.15; 'url-ip:140/8':
0.15; '(why': 0.16; 'are.': 0.16; 'assuming': 0.16; 'attributes':
0.16; 'found,': 0.16; 'functions.': 0.16; 'html,': 0.16; 'idea!':
0.16; 'intact': 0.16; 'lxml': 0.16; 'me?': 0.16; 'mean,': 0.16;
'namespaces': 0.16; 'prefix': 0.16; 'rather,': 0.16; 'subject: --
': 0.16; 'tag,': 0.16; 'trees.': 0.16; 'undo': 0.16; 'want,':
0.16; 'wrote:': 0.16; 'python': 0.16; 'uses': 0.19; 'to:addr
:python-list': 0.20; 'all,': 0.20; 'input': 0.21; 'maybe': 0.22;
'thanks!': 0.24; 'cannot': 0.25; 'anyone': 0.25; 'robert': 0.26;
'leave': 0.27; 'function': 0.27; 'fact': 0.28; 'purpose': 0.28;
'suggest': 0.28; 'header:User-Agent:1': 0.30; 'whole': 0.30;
'think': 0.32; "doesn't": 0.32; 'anybody': 0.32; 'deleted': 0.32;
'python-list': 0.32; 'but': 0.32; "i'm": 0.33; "didn't": 0.34;
'header:In-Reply-To:1': 0.34; 'trying': 0.35; 'handling': 0.35;
'url:)': 0.35; 'source': 0.36; 'those': 0.36; 'missing': 0.37;
'really': 0.37; 'using': 0.37; "it's": 0.37; 'way': 0.38; 'could':
0.38; 'means': 0.38; 'text': 0.39; 'mentioned': 0.39; 'use': 0.39;
'skip:u 20': 0.39; '(see': 0.40; 'deleting': 0.40; 'place.': 0.40;
'tell': 0.60; 'search': 0.61; 'method': 0.61; 'here.': 0.61;
'skip:h 10': 0.61; 'skip:i 20': 0.62; 'skip:b 20': 0.63; 'skip:b
10': 0.63; 'full': 0.64; 'your': 0.64; 'look': 0.65; 'outside':
0.67; 'url-ip:104.18/16': 0.67; 'skip:e 20': 0.67; 'exactly':
0.68; 'url:net': 0.68; 'order': 0.69; 'manually': 0.69; 'url:svg':
0.69; 'below': 0.69; 'url:dtd': 0.74; 'life': 0.77; 'attribute':
0.84; 'literally': 0.84; 'realizing': 0.84; 'uri': 0.84;
'url:sourceforge': 0.84; 'forest': 0.91
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.2.2
Content-Language: en-US
In-Reply-To: <jrac70Frmk7U1@mid.individual.net>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=declassed.art;
i=@declassed.art; q=dns/txt; s=20210603; t=1666193043; h=message-id :
date : mime-version : subject : to : references : from : in-reply-to :
content-type : content-transfer-encoding : from;
bh=QhDJNVt8I9WZU0JGGTTVy3jkGhRyFcm+UCAQAjrKMhg=;
b=N8Q9IfxcvvUhtFt/+hY9nJQFQD3ccxZn+DZcynaBZsdaEfMlhZPRLziqU8rOYqISXFCFf
kPse0NepzKwqBniGB7DtazszvvUwlfDCBunEvTie/b2R7RaYBtM0BqPlGkqnv+y+usYIVsM
ruYeezcPICvghgLp7ZgCWcuiDeZVrme6ex/bousLCSwlbewz6qv7H8LvLAiaH/k/a0kaWuw
fXgsEUf3At3mht3omb1gAMuBiMgXrSBFHH06wc2HgCEV4usiVAfHbUBIGTS/zfntlufCaLC
RUWHwdSuv3Rc8vyzxrVkg1vUNAhtu4kPT1JisqXBfdnC9cIXMGPtkTQAekjA==
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <6715508f-e13c-c3e7-98d7-4fadda4420a6@declassed.art>
X-Mailman-Original-References: <jrac70Frmk7U1@mid.individual.net>

by: Axy - Wed, 19 Oct 2022 15:23 UTC

I mean, it's worth to look at BeautifulSoup source how do they do that.
With BS I work with attributes exactly as you want, and I explicitly
tell BS to use lxml parser.

Axy.

On 19/10/2022 14:25, Robert Latest via Python-list wrote:
> Hi all,
>
> For the impatient: Below the longish text is a fully self-contained Python
> example that illustrates my problem.
>
> I'm struggling to understand xml.etree's handling of namespaces. I'm trying to
> parse an Inkscape document which uses several namespaces. From etree's
> documentation:
>
> If the XML input has namespaces, tags and attributes with prefixes in the
> form prefix:sometag get expanded to {uri}sometag where the prefix is
> replaced by the full URI.
>
> Which means that given an Element e, I cannot directly access its attributes
> using e.get() because in order to do that I need to know the URI of the
> namespace. So rather than doing this (see example below):
>
> label = e.get('inkscape:label')
>
> I need to do this:
>
> label = e.get('{' + uri_inkscape_namespace + '}label')
>
> ...which is the method mentioned in etree's docs:
>
> One way to search and explore this XML example is to manually add the URI
> to every tag or attribute in the xpath of a find() or findall().
> [...]
> A better way to search the namespaced XML example is to create a
> dictionary with your own prefixes and use those in the search functions.
>
> Good idea! Better yet, that dictionary or rather, its reverse, already exists,
> because etree has used it to unnecessarily mangle the namespaces in the first
> place. The documentation doesn't mention where it can be found, but we can
> just use the 'xmlns:' attributes of the <svg> root element to rebuild it. Or
> so I thought, until I found out that etree deletes exactly these attributes
> before handing the <svg> element to the user.
>
> I'm really stumped here. Apart from the fact that I think XML is bloated shit
> anyway and has no place outside HTML, I just don't get the purpose of etree's
> way of working:
>
> 1) Evaluate 'xmlns:' attributes of the <svg> element
> 2) Use that info to replace the existing prefixes by {uri}
> 3) Realizing that using {uri} prefixes is cumbersome, suggest to
> the user to build their own prefix -> uri dictionary
> to undo the effort of doing 1) and 2)
> 4) ...but witholding exactly the information that existed in the original
> document by deleting the 'xmlns:' attributes from the <svg> tag
>
> Why didn't they leave the whole damn thing alone? Keep <svg> intact and keep
> the attribute 'prefix:key' literally as they are. For anyone wanting to use
> the {uri} prefixes (why would they) they could have thrown in a helper
> function for the prefix->URI translation.
>
> I'm assuming that etree's designers knew what they were doing in order to make
> my life easier when dealing with XML. Maybe I'm missing the forest for the
> trees. Can anybody enlighten me? Thanks!
>
>
> #### self-contained example
> import xml.etree.ElementTree as ET
>
> def test_svg(xml):
> root = ET.fromstring(xml)
> for e in root.iter():
> print(e.tag) # tags are shown prefixed with {URI}
> if e.tag.endswith('svg'):
> # Since namespaces are defined inside the <svg> tag, let's use the info
> # from the 'xmlns:' attributes to undo etree's URI prefixing
> print('Element <svg>:')
> for k, v in e.items():
> print(' %s: %s' % (k, v))
> # ...but alas: the 'xmlns:' attributes have been deleted by the parser
>
> xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
> 
>
> <svg
> width="210mm"
> height="297mm"
> viewBox="0 0 210 297"
> version="1.1"
> id="svg285"
> inkscape:version="1.2.1 (9c6d41e410, 2022-07-14)"
> sodipodi:docname="test.svg"
> xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
> xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
> xmlns="http://www.w3.org/2000/svg"
> xmlns:svg="http://www.w3.org/2000/svg">
> <sodipodi:namedview
> id="namedview287"
> pagecolor="#ffffff"
> bordercolor="#000000"
> borderopacity="0.25"
> inkscape:showpageshadow="2"
> inkscape:pageopacity="0.0"
> inkscape:pagecheckerboard="0"
> inkscape:deskcolor="#d1d1d1"
> inkscape:document-units="mm"
> showgrid="false"
> inkscape:zoom="0.2102413"
> inkscape:cx="394.78447"
> inkscape:cy="561.25984"
> inkscape:window-width="1827"
> inkscape:window-height="1177"
> inkscape:window-x="85"
> inkscape:window-y="-8"
> inkscape:window-maximized="1"
> inkscape:current-layer="layer1" />
> <defs
> id="defs282" />
> <g
> inkscape:label="Ebene 1"
> inkscape:groupmode="layer"
> id="layer1">
> <rect
> style="fill:#aaccff;stroke-width:0.264583"
> id="rect289"
> width="61.665253"
> height="54.114403"
> x="33.978813"
> y="94.38559" />
> </g>
> </svg>
> '''
>
> if __name__ == '__main__':
> test_svg(xml)

Re: xml.etree and namespaces -- why?

<slrntl08ks.1q9r.jon+usenet@raven.unequivocal.eu>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=24376&group=comp.lang.python#24376

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: jon+usenet@unequivocal.eu (Jon Ribbens)
Newsgroups: comp.lang.python
Subject: Re: xml.etree and namespaces -- why?
Date: Wed, 19 Oct 2022 16:15:24 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <slrntl08ks.1q9r.jon+usenet@raven.unequivocal.eu>
References: <jrac70Frmk7U1@mid.individual.net>
Injection-Date: Wed, 19 Oct 2022 16:15:24 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="771a18263e1c243551650db3b31dd5bc";
logging-data="92029"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19iFb2w7pwidc+JR9mp9klzm/fklRaG/9w="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:ef9dT3M9QdAzdUenpjfms7dyI+c=

by: Jon Ribbens - Wed, 19 Oct 2022 16:15 UTC

On 2022-10-19, Robert Latest <boblatest@yahoo.com> wrote:
> If the XML input has namespaces, tags and attributes with prefixes
> in the form prefix:sometag get expanded to {uri}sometag where the
> prefix is replaced by the full URI.
>
> Which means that given an Element e, I cannot directly access its attributes
> using e.get() because in order to do that I need to know the URI of the
> namespace.

That's because you *always* need to know the URI of the namespace,
because that's its only meaningful identifier. If you assume that a
particular namespace always uses the same prefix then your code will be
completely broken. The following two pieces of XML should be understood
identically:

and:

So you can see why e.get('inkscape:label') cannot possibly work, and why
e.get('{http://www.inkscape.org/namespaces/inkscape}label') makes sense.

The xml.etree author obviously knew that this was cumbersome, and
hence you can do something like:

namespaces = {'inkspace': 'http://www.inkscape.org/namespaces/inkscape'}
element = root.find('inkspace:foo', namespaces)

which will work for both of the above pieces of XML.

But unfortunately as far as I can see nobody's thought about doing the
same for attributes rather than tags.

Re: xml.etree and namespaces -- why?

<mailman.754.1666197232.20444.python-list@python.org>

copy mid

https://www.rocksolidbbs.com/devel/article-flat.php?id=24377&group=comp.lang.python#24377

copy link Newsgroups: comp.lang.python

by: Axy - Wed, 19 Oct 2022 14:53 UTC

I have no idea why, I used to remove namespaces, following the advice
from stackoverflow:

https://stackoverflow.com/questions/4255277/lxml-etree-xmlparser-remove-unwanted-namespace

_ns_removal_xslt_transform = etree.XSLT(etree.fromstring('''
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

    <xsl:template match="/|comment()|processing-instruction()">
        <xsl:copy>
          <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*">
        <xsl:element name="{local-name()}">
          <xsl:apply-templates select="@*|node()"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@*">
        <xsl:attribute name="{local-name()}">
          <xsl:value-of select="."/>
        </xsl:attribute>
    </xsl:template>
    </xsl:stylesheet>
'''))

xml_doc = _ns_removal_xslt_transform(

etree.fromstring(my_xml_data)

)

Later on, when I worked with SVG, I used BeautifulSoup.

Axy.

Jon Ribbens wrote:
> That's because you *always* need to know the URI of the namespace,
> because that's its only meaningful identifier. If you assume that a
> particular namespace always uses the same prefix then your code will be
> completely broken. The following two pieces of XML should be understood
> identically:
>
> <svg xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" " rel="nofollow" target="_blank">http://www.inkscape.org/namespaces/inkscape">
> <g inkscape:label="Ebene 1" inkscape:groupmode="layer" id="layer1">
>
> and:
>
> <svg xmlns:epacskni="http://www.inkscape.org/namespaces/inkscape">
> <g epacskni:label="Ebene 1" epacskni:groupmode="layer" id="layer1">
>
> So you can see why e.get('inkscape:label') cannot possibly work, and why
> e.get('{http://www.inkscape.org/namespaces/inkscape}label') makes sense.

I get it. It does.

> The xml.etree author obviously knew that this was cumbersome, and
> hence you can do something like:
>
> namespaces = {'inkspace': 'http://www.inkscape.org/namespaces/inkscape'}
> element = root.find('inkspace:foo', namespaces)
>
> which will work for both of the above pieces of XML.

Makes sense. It forces me to make up my own prefixes which I can then safely
use in my code rather than relying on the xml's generator to not change their
prefixes.

BTW, I only now thought to look at what actually is at Inkscape's namespace
URI, and it turns out to be quite a nice explanation of what a namespace is and
why it looks like a URL.

Your code should be more efficient!

devel / comp.lang.python / Re: xml.etree and namespaces -- why?

devel / comp.lang.python / Re: xml.etree and namespaces -- why?

Subject	Author
xml.etree and namespaces -- why?	Robert Latest
Re: xml.etree and namespaces -- why?	Axy
Re: xml.etree and namespaces -- why?	Jon Ribbens
Re: xml.etree and namespaces -- why?	Robert Latest
Re: xml.etree and namespaces -- why?	Axy