Simplexml and xml namespaces

20 years ago by Rasmus Lerdorf — view source — reply

unread

I don't really understand how the current xml namespace handling in
simplexml is useful.

test.xml:

<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">; <node> <title>Title1</title> <title>Title2</title> <media:title>Media Title</media:title> </node> </rss>

$xml = simplexml_load_file('test.xml');
$xml->node ends up containing a title array that looks like this:

["title"]=>
array(3) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
[2] => string(11) "Media Title"
}

Of course, I can loop through
$node->children('http://search.yahoo.com/mrss')
and get
["title "]=>
object(SimpleXMLElement)#8 (1) {
[0] => string(11) "Media Title"
}

But how does this really help? I don't see how it is possible to
distinguish the namespaced title vs. the non-namespaced ones. My
suggestion here would be that for namespaced nodes the namespace alias
(or perhaps the actual namespace?) becomes the key in the nodes array.
As in:

["title"]=>
array(3) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
['media'] => string(11) "Media Title"
}

So people have a shot at distinguishing media:title from <title>

Or, alternatively, have a separate arrays:

["title"]=>
array(2) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
}
["media:title"]=>string(11) "Media Title"

The latter is actually what I was (naiively) expecting.

-Rasmus

20 years ago by George Schlossnagle — view source — reply

unread

Or, alternatively, have a separate arrays:

["title"]=>
array(2) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
}
["media:title"]=>string(11) "Media Title"

The latter is actually what I was (naiively) expecting.

I think this latter one is bad. The local prefix is local, and I
think it's broken to have it show up in a parsing of the document. I
agree the current namespace handling is really frustrating, but I
don't think this is a good solution.

George

20 years ago by Sterling Hughes — view source — reply

unread

Hm - that shouldn't be.

I think the right solution is that media:title should not show up in
the children of node, unless you are looking at the proper namespace,
ie, you need to use children() to get the children in that namespace.

-Sterling

I don't really understand how the current xml namespace handling in
simplexml is useful.

test.xml:
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">; <node> <title>Title1</title> <title>Title2</title> <media:title>Media Title</media:title> </node> </rss>
$xml = simplexml_load_file('test.xml');
$xml->node ends up containing a title array that looks like this:

["title"]=>
array(3) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
[2] => string(11) "Media Title"
}

Of course, I can loop through
$node->children('http://search.yahoo.com/mrss')
and get
["title "]=>
object(SimpleXMLElement)#8 (1) {
[0] => string(11) "Media Title"
}

But how does this really help? I don't see how it is possible to
distinguish the namespaced title vs. the non-namespaced ones. My
suggestion here would be that for namespaced nodes the namespace alias
(or perhaps the actual namespace?) becomes the key in the nodes array.
As in:

["title"]=>
array(3) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
['media'] => string(11) "Media Title"
}

So people have a shot at distinguishing media:title from <title>

Or, alternatively, have a separate arrays:

["title"]=>
array(2) {
[0] => string(6) "Title1"
[1] => string(6) "Title2"
}
["media:title"]=>string(11) "Media Title"

The latter is actually what I was (naiively) expecting.

-Rasmus

20 years ago by Rasmus Lerdorf — view source — reply

unread

Sterling Hughes wrote:

Hm - that shouldn't be.

I think the right solution is that media:title should not show up in
the children of node, unless you are looking at the proper namespace,
ie, you need to use children() to get the children in that namespace.

Ah, you are right. It's the damn var_dump() problem again.
eg.

$xml = <<<EOF
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">;
<node>

<title>Title</title> <media:title>Media Title</media:title> </node> </rss> EOF; $x = simplexml_load_string($xml);

var_dump($x->node) shows:

object(SimpleXMLElement)#2 (1) {
["title"]=>
array(2) {
[0]=>
string(6) "Title"
[1]=>
string(11) "Media Title"
}
}

but var_dump($x->node->title) shows:

object(SimpleXMLElement)#4 (1) {
[0]=>
string(6) "Title"
}

There should be a simplexml_dump() or something along those lines and
perhaps even a warning in var_dump() when it tries to dump an object
that has its own iterator.

-Rasmus

20 years ago by john — view source — reply

unread

Long time reader, first time poster.

Rasmus, I noticed your var_dump says $x->node->title is of string(6) ...
though I count only 5. Just wondering, a simple typo or something more
involved?

Regards,
John

Rasmus Lerdorf wrote:

Sterling Hughes wrote:

Hm - that shouldn't be.

I think the right solution is that media:title should not show up in
the children of node, unless you are looking at the proper namespace,
ie, you need to use children() to get the children in that namespace.

Ah, you are right. It's the damn var_dump() problem again.
eg.

$xml = <<<EOF
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss">;
<node>
<title>Title</title> <media:title>Media Title</media:title> </node> </rss> EOF; $x = simplexml_load_string($xml);
var_dump($x->node) shows:

object(SimpleXMLElement)#2 (1) {
["title"]=>
array(2) {
[0]=>
string(6) "Title"
[1]=>
string(11) "Media Title"
}
}

but var_dump($x->node->title) shows:

object(SimpleXMLElement)#4 (1) {
[0]=>
string(6) "Title"
}

There should be a simplexml_dump() or something along those lines and
perhaps even a warning in var_dump() when it tries to dump an object
that has its own iterator.

-Rasmus

20 years ago by Rasmus Lerdorf — view source — reply

unread

john wrote:

Long time reader, first time poster.

Rasmus, I noticed your var_dump says $x->node->title is of string(6) ...
though I count only 5. Just wondering, a simple typo or something more
involved?

That was just me munging the output a bit. It gets the right length.

-Rasmus

20 years ago by Adam Maccabee Trachtenberg — view source — reply

unread

But how does this really help? I don't see how it is possible to
distinguish the namespaced title vs. the non-namespaced ones. My
suggestion here would be that for namespaced nodes the namespace alias
(or perhaps the actual namespace?) becomes the key in the nodes array.

XML Namespaces are a real PITA. I remember Rob, Sterling, and I went
through a variety of iterations around this.

The biggest problems is that prefixes are really not something you can
rely on at all -- they are just a handy fiction -- the namespace name
is really what the XML processor uses.

If you're consuming a feed and the provider alters the namespace
prefix, but binds it to the same namespace, then the document is
considered identical. However, if you're relying on a specific prefix
in your code (instead of the actual namespace), then your code is
busted.

Since people don't always have control over producing the XML
documents process, it doesn't seem reasonable to force people not to
let others change prefixes.

Second, default namespaces also screw things up entirely, as you have
no way to access <foo ns="a">. It's different from <foo>, so they
shouldn't be lumped together, but there's no prefix you can use to
access it. Now you have to have a way of registering prefixes, so you
can access elements in default namespaces.

FWIW, this exact problem is the #1 XSLT FAQ because people don't
realize that elements in a default namespace aren't the same as
non-namespaced elements.

(Of course there is the issue of what happens when something switches
from having a prefix to being in a default namespace -- again it is
the identical document, but code is broken.)

Last, you can get weird rebinding of namespace prefixes:

These two <a:bar>s are different.

Ultimately, for those reasons, if you want to reliably access a XML
document using namespaces prefixes, you really need to register your
own prefixes for every namespace used in the document and use those in
your code, or things could potentially break even under a valid XML
document.

It was really those two issues that caused we (I think it was largely
Rob) to suggest we end up using children() and attributes() with the
namespace name instead of the prefix.

I really do think it is the cleanest solution that doesn't break down
when you reach the edge cases.

-adam

--
adam@trachtenberg.com | http://www.trachtenberg.com
author of o'reilly's "upgrading to php 5" and "php cookbook"
avoid the holiday rush, buy your copies today!

20 years ago by Rasmus Lerdorf — view source — reply

unread

Adam Maccabee Trachtenberg wrote:

But how does this really help? I don't see how it is possible to
distinguish the namespaced title vs. the non-namespaced ones. My
suggestion here would be that for namespaced nodes the namespace alias
(or perhaps the actual namespace?) becomes the key in the nodes array.

XML Namespaces are a real PITA. I remember Rob, Sterling, and I went
through a variety of iterations around this.

The biggest problems is that prefixes are really not something you can
rely on at all -- they are just a handy fiction -- the namespace name
is really what the XML processor uses.

If you're consuming a feed and the provider alters the namespace
prefix, but binds it to the same namespace, then the document is
considered identical. However, if you're relying on a specific prefix
in your code (instead of the actual namespace), then your code is
busted.

Since people don't always have control over producing the XML
documents process, it doesn't seem reasonable to force people not to
let others change prefixes.

Second, default namespaces also screw things up entirely, as you have
no way to access <foo ns="a">. It's different from <foo>, so they
shouldn't be lumped together, but there's no prefix you can use to
access it. Now you have to have a way of registering prefixes, so you
can access elements in default namespaces.

FWIW, this exact problem is the #1 XSLT FAQ because people don't
realize that elements in a default namespace aren't the same as
non-namespaced elements.

(Of course there is the issue of what happens when something switches
from having a prefix to being in a default namespace -- again it is
the identical document, but code is broken.)

Last, you can get weird rebinding of namespace prefixes:
<foo ns:a="x"> <a:bar/> <a:bar ns:a="y"/> </foo>
These two <a:bar>s are different.

Ultimately, for those reasons, if you want to reliably access a XML
document using namespaces prefixes, you really need to register your
own prefixes for every namespace used in the document and use those in
your code, or things could potentially break even under a valid XML
document.

It was really those two issues that caused we (I think it was largely
Rob) to suggest we end up using children() and attributes() with the
namespace name instead of the prefix.

I really do think it is the cleanest solution that doesn't break down
when you reach the edge cases.

Yeah, I agree actually. My real beef is that simplexml and var_dump()
don't place nicely with each other. var_dump() ends up lumping the
namespaced elements in with the non-namespaced elements of the same
name, but when you iterate through things manually they are not lumped
together and the only way to get at the namespaced elements is by
checking for them directly with the appropriate children() call.

I am fine with having to manually dereference the namespace and keeping
things completely separate. I'd just like it to be easier for people to
use var_dump() on a simplexml object and not have it confuse the heck
out of them by showing them arrays with 2 elements in them which when
they iterate only get 1 or if they call count() on it only get 1.

-Rasmus

20 years ago by Adam Maccabee Trachtenberg — view source — reply

unread

Yeah, I agree actually. My real beef is that simplexml and var_dump()
don't place nicely with each other. var_dump() ends up lumping the
namespaced elements in with the non-namespaced elements of the same
name, but when you iterate through things manually they are not lumped
together and the only way to get at the namespaced elements is by
checking for them directly with the appropriate children() call.

I am fine with having to manually dereference the namespace and keeping
things completely separate. I'd just like it to be easier for people to
use var_dump() on a simplexml object and not have it confuse the heck
out of them by showing them arrays with 2 elements in them which when
they iterate only get 1 or if they call count() on it only get 1.

I totally agree with this. In fact, I'd say this is a specific
instance of a more general problem -- overloaded objects just don't
play nicely with var_dump() and friends.

Overloading is cool, but it makes it much harder to debug objects
because there's no standard way to introspect them to see how things
"really" are under the covers, and when you do use var_dump(), you may
get lied to in all sorts of unpredictable ways and not even know it,
as there's no way to know a priori that an object is overloaded.

-adam

--
adam@trachtenberg.com | http://www.trachtenberg.com
author of o'reilly's "upgrading to php 5" and "php cookbook"
avoid the holiday rush, buy your copies today!

20 years ago by Rob Richards — view source — reply

unread

Rasmus Lerdorf wrote:

Yeah, I agree actually. My real beef is that simplexml and var_dump()
don't place nicely with each other. var_dump() ends up lumping the
namespaced elements in with the non-namespaced elements of the same
name, but when you iterate through things manually they are not lumped
together and the only way to get at the namespaced elements is by
checking for them directly with the appropriate children() call.

I am fine with having to manually dereference the namespace and keeping
things completely separate. I'd just like it to be easier for people to
use var_dump() on a simplexml object and not have it confuse the heck
out of them by showing them arrays with 2 elements in them which when
they iterate only get 1 or if they call count() on it only get 1.

It doesnt look difficult to make var_dump respect the namespace set by
the initial sxe object for subobjects. If it were to be changed I would
also suggest not returning non element type nodes as well. Right now PI,
Comments, etc.. get returned by var_dump but these aren't considered sxe
properties either.

Rob

20 years ago by Sterling Hughes — view source — reply

unread

I agree. var_dump() should accurately expose the structure of the
simplexml object, if people want to see everything they should dump
it explicitly (there is a method in the DOM api to do this?)

-Sterling

Rasmus Lerdorf wrote:

Yeah, I agree actually. My real beef is that simplexml and var_dump()
don't place nicely with each other. var_dump() ends up lumping the
namespaced elements in with the non-namespaced elements of the same
name, but when you iterate through things manually they are not lumped
together and the only way to get at the namespaced elements is by
checking for them directly with the appropriate children() call.

I am fine with having to manually dereference the namespace and keeping
things completely separate. I'd just like it to be easier for people to
use var_dump() on a simplexml object and not have it confuse the heck
out of them by showing them arrays with 2 elements in them which when
they iterate only get 1 or if they call count() on it only get 1.

It doesnt look difficult to make var_dump respect the namespace set by
the initial sxe object for subobjects. If it were to be changed I would
also suggest not returning non element type nodes as well. Right now PI,
Comments, etc.. get returned by var_dump but these aren't considered sxe
properties either.

Rob

20 years ago by Adam Maccabee Trachtenberg — view source — reply

unread

I agree. var_dump() should accurately expose the structure of the
simplexml object, if people want to see everything they should dump
it explicitly (there is a method in the DOM api to do this?)

You mean other than reserializing the data back as XML? :)

All kidding aside, I do use SXE's asXML() and DOM's saveXML() for
debugging in these types of cases.

-adam

--
adam@trachtenberg.com | http://www.trachtenberg.com
author of o'reilly's "upgrading to php 5" and "php cookbook"
avoid the holiday rush, buy your copies today!

20 years ago by Rob Richards — view source — reply

unread

There isn't a single method in DOM for this - have to write code to do
it. get_properties was not implemented in DOM due to too many properties
and many properties recursive (DOM both ascends and descends a tree).
Any debugging would be useless trying to sort through all the crap. I,
like Adam, also use those methods when I need to examine a subtree.

Rob

Adam Maccabee Trachtenberg wrote:

I agree. var_dump() should accurately expose the structure of the
simplexml object, if people want to see everything they should dump
it explicitly (there is a method in the DOM api to do this?)

You mean other than reserializing the data back as XML? :)

All kidding aside, I do use SXE's asXML() and DOM's saveXML() for
debugging in these types of cases.

20 years ago by Marcus Boerger — view source — reply

unread

Hello Rob,

what we need here is a temp hash table to store the names. var_dump would
grep them through get_properties...

Friday, August 19, 2005, 9:34:54 PM, you wrote:

There isn't a single method in DOM for this - have to write code to do
it. get_properties was not implemented in DOM due to too many properties
and many properties recursive (DOM both ascends and descends a tree).
Any debugging would be useless trying to sort through all the crap. I,
like Adam, also use those methods when I need to examine a subtree.

Rob

Adam Maccabee Trachtenberg wrote:

I agree. var_dump() should accurately expose the structure of the
simplexml object, if people want to see everything they should dump
it explicitly (there is a method in the DOM api to do this?)

You mean other than reserializing the data back as XML? :)

All kidding aside, I do use SXE's asXML() and DOM's saveXML() for
debugging in these types of cases.

Best regards,
Marcus