Hi internals!
While browsing through bugsnet I encountered this SimpleXML issue with 252 votes: https://bugs.php.net/bug.php?id=54632
TLDR: when you have a XML document (modified a bit from the example in the bugtracker):
<?xml version="1.0" encoding="UTF-8" ?>
<a><b id="foo">foo</b>bar</a>
And you load it into simpleXML, the result of calling json_encode($the_simplexml_object) on that is:
{"b":{"@attributes":{"id":"foo"}}}
There's 2 strange things here:
- Where is a?
- Where is the text for b (and a)?
What's going on here is that json_encode()
gives the JSON representation of what var_dump()
gives you.
This behaviour is perceived as a bug, given the number of votes and the comment section.
It's possible to change the JSON encoding, without affected var_dump()
and the way you access simpleXML objects.
One comment suggests the following JSON representation for the above XML:
{"a":{"b":{"@attributes":{"id":"foo"},"@text":"foo"},"@text":"bar"}}
This seems reasonable. Let's take a look at how multiple tags are handled right now and how that would work for text nodes.
SimpleXML currently handles multiple tags with the same name by placing them in an array:
Given: <?xml version="1.0" encoding="UTF-8" ?><a><b id="foo"/><x/><y/><x/></a>
You'll get: {"b":{"@attributes":{"id":"foo"}},"x":[{},{}],"y":{}}
We could do the same for text nodes. Given: <?xml version="1.0" encoding="UTF-8" ?><a><b id="foo"/>foo<x/>bar<y/>baz<x/></a>
Could give: {"a":{"b":{"@attributes":{"id":"foo"}},"x":[{},{}],"y":{}}, "@text": ["foo", "bar", "baz"]}}
Now, this would still not allow to reconstruct the document based on the JSON however, as the ordering between tags&text is lost (just as is the case now for ordering between different tags).
I'm not sure what the community specifically wants here.
Are there opinions on how this should behave?
Kind regards
Niels
On 14 August 2023 13:40:40 BST, Niels Dossche dossche.niels@gmail.com
wrote:
And you load it into simpleXML, the result of calling
json_encode($the_simplexml_object)
My usual reaction to this is "why would you take an object designed for
accessing parts of an XML document, and serialise it to JSON?" Often,
the answer turns out to be "because I don't understand SimpleXML
objects, and have copied and pasted a weird hack to get a less useful
array representation by round-tripping to JSON".
On the other hand, the fact that the debug representation of SimpleXML
objects misses out some parts causes a lot of confusion, and I've
actually considered the opposite of what you suggest - leave the JSON
alone, because people will have written production code based on it, but
make the debug array more descriptive of how to use the object.
Either way, the challenge is coming up with something that's concise for
simple structures, but comprehensive for more complex ones, particularly
if you want it to be consistent. For instance:
- Do you assume tag names are unique within a parent, so use key=>value
directly; or assume they're not, so use key=>[list,of,values]; or
dynamically switch between the two? - Do you care about the order of elements with different names, or
prefer to group by name? - Do you have any elements with both child tags and text, or attributes
and text, or all three? - Do you need to retain the order of text in relation to child elements
(important for markup languages like HTML or DocBook)? Or is it enough
to have a representation of "all text content" (the behaviour of
SimpleXML's string cast)? - Do you have any elements with namespaces? If so, do you want to use
local prefixes (and include the xmlns attributes somewhere), or repeat
the full namespace URI?
There's a reason why both the DOM and SimpleXML provide object-oriented
APIs for accessing the document, not a representation flattened to
native types, and why both APIs are useful for different jobs - XML just
isn't designed for flattening, and different patterns make sense for
different documents / use cases.
Ultimately, I'm not that interested in trying to come up with a JSON or
array representation that covers every possibility, because I think the
only consistent answer would be horribly verbose - basically, describe
every property that DOM would expose on each node.
For debug output, the main concern is showing what you'll get with
various styles of access in SimpleXML, so a single "@text" =>
"foobarbaz" would make sense; or maybe even "(string)" => "foobarbaz"
and rename "@attributes" to "->attributes()"
Regards,
--
Rowan Tommins
[IMSoP]
And you load it into simpleXML, the result of calling json_encode($the_simplexml_object)
My usual reaction to this is "why would you take an object designed for accessing parts of an XML document, and serialise it to JSON?" Often, the answer turns out to be "because I don't understand SimpleXML objects, and have copied and pasted a weird hack to get a less useful array representation by round-tripping to JSON".
That's fair, I'm just trying to hear what people want here :-)
On the other hand, the fact that the debug representation of SimpleXML objects misses out some parts causes a lot of confusion, and I've actually considered the opposite of what you suggest - leave the JSON alone, because people will have written production code based on it, but make the debug array more descriptive of how to use the object.
I agree the debug info is lackluster at the moment.
e.g. If you var_dump a simpleXML object you don't even see the name or the type. So when you're debugging it becomes a bit more difficult to see what you're working with. Similarly, a getType()
method would be handy to verify what's actually "behind" the simpleXML object.
(...)
Ultimately, I'm not that interested in trying to come up with a JSON or array representation that covers every possibility, because I think the only consistent answer would be horribly verbose - basically, describe every property that DOM would expose on each node.
Probably yeah, I tried to come up with something but it was indeed verbose.
For debug output, the main concern is showing what you'll get with various styles of access in SimpleXML, so a single "@text" => "foobarbaz" would make sense; or maybe even "(string)" => "foobarbaz" and rename "@attributes" to "->attributes()"
Having "->method()" might look a bit weird, at least I haven't seen something like this in other PHP extensions. Although it is more descriptive than "@attributes", because people may be led to believe you can do $simpleXML->{"@attributes"} or something like that. Adding a name and type in the output is also necessary I think.
Making such changes probably requires an RFC and some bikeshedding to make sure everyone agrees on it.
Regards,
Kind regards