Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:7050
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
In-Reply-To: <00fe01c3d9d1$6e8618a0$f7dea8c0@cyberware.local>
References: <Pine.LNX.4.58.0401130248410.21920@miranda.org> <00fe01c3d9d1$6e8618a0$f7dea8c0@cyberware.local>
Mime-Version: 1.0 (Apple Message framework v609)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-ID: <F456CE88-45DA-11D8-B3DB-003065FB1208@trachtenberg.com>
Content-Transfer-Encoding: 7bit
Cc: <internals@lists.php.net>
Date: Tue, 13 Jan 2004 10:12:51 -0500
To: "Rob Richards" <rrichards@ctindustries.net>
Subject: Re: [PHP-DEV] SimpleXML: Moving Forward
From: adam@trachtenberg.com (Adam Trachtenberg)

On Jan 13, 2004, at 7:33 AM, Rob Richards wrote:

> From: Adam Maccabee Trachtenberg
>
>> 1) SimpleXML creates PHP data structures from XML documents. It only
>>    handles XML elements, attributes, and text nodes. The syntax for
>>    accessing the text node children of an element is akin to object
>>    properties ($foo->bar); the syntax of accessing attributes is akin
>>    to array elements ($foo['bar']).
>
> This goes back to my question on what is the goal of SimpleXML?
> Is it supposed to be an easy api to be able to access any xml document 
> or
> only not complex ones?

Here's where I see the benefit of SimpleXML. SimpleXML should be used 
when you know the schema of an XML document and want to extract 
specific pieces of data from it. My favorite use-cases are: RSS, REST, 
and configuration files.

This doesn't mean there's necessarily a formal XML Schema or RelaxNG 
document, but that the developer is familiar enough with the layout of 
the XML document that she knows what she's looking for and can 
formulate code to access the information she wants.

In most cases, this will be through directly accessing text nodes 
through $foo->bar->baz. More complex cases will be handled using XPath: 
/rss:foo[begins-with('dc:bar', '2004-01')]/rss:baz.

In my ideal world, you can use SimpleXML for all XML documents, 
regardless of complexity (read: namespaces, right?). However, if this 
lead to an unnecessary amount of complexity, I would sacrifice this 
point.

Also, since there's some assumption of developer fore-knowledge of the 
document's schema, there's no need for an overwhelming set of 
introspection functions, since that's where DOM excels.

To sum up: it would be helpful to see some *real world* XML documents 
that people want to parse using SimpleXML.
We'd then try very hard to make sure SimpleXML was easy to use for 
those documents. It's easy to make up theoretical XML documents that 
are well-formed and pathologically nasty, but it'd much prefer to leave 
those to DOM.

> Attributes are handled associative arrays, so given an element with 2
> attributes with the same name, but in different namespaces, it wont 
> work:
> <foo a:bar="x" b:bar="y">
>
> xpath wont help here either as xsearch returns an array of sxe objects 
> with
> the attribute nodes (which causes some additional problems).
> Its fine if this would have to be handled in dom, but to me the 
> question
> really has never been fully answered.
> See also example under the xpath comments for elements containing 
> mixed text
> and element nodes.

Ugh. That's nasty. I would prefer to not handle this in SimpleXML. Have 
you really even seen a case where someone did this?

>>    When deciding the behavior of these functions (e.g. Does
>>    getChildren() return just the direct descendents or all children
>>    regardless of depth?), we'll define them to mimic XPath's behavior:
>>    (e.g. /child::node()). This reduces the potential for disagreement
>>    over what is the "correct" way to do things. (I'm just looking for
>>    a way to prevent protracted discussions over issues that have no
>>    clear "right" answers and can never really be solved.)
>
> Should only be direct descendants. One should be able to navigate the 
> entire
> tree (elements/attributes) in a standard way without having to use 
> xpath.
> imho, this is one of the biggest reason why the two functions should be
> implemented.

I agree here.

>> 4) XPath and validation functions will be available in SimpleXML, but
>>    we will not try to code generic extensions that work with both
>>    SimpleXML and DOM if for no other reason than this is not
>>    guaranteed to be simple. (e.g. SimpleXML must remove from XPath
>>    results nodes that aren't elements, attributes, and text nodes.)
>
> return types need to be standardized. attributes or getAttributes 
> returns
> name/value array, while the current xsearch will return array of a sxe
> objects of the attribute node (which stated before is bad in the 
> current
> state of simplexml).

I also agree here. This is one of the reasons I feel it's important to 
hash out these details now, so that all the functions work 
consistently.

I would prefer to always return an array (or a SimpleXML_List object 
that's similar to DOM nodeList) of SimpleXML objects from any querying 
function, whether it's getChildren(), getAttributes, or xPathQuery(). I 
think this is most consistent.

For example:

<foo>
	<bar>a</bar>
	<bar>a</bar>
</foo>

Therefore, $xml->xPathQuery('/foo/bar') and $foo->getChildren() (and 
maybe $foo?) would be equivalent.

> Also, consider the following (an element contains a mix of text and 
> element
> nodes):
> $foo = simplexml_load_string('<foo>ab<foo2>test</foo2>cd</foo>');
> $ns = $foo->xsearch('child::text()');
> foreach ($ns as $node) {
>  print "Node Value: ".$node."\n";
> }
>
> Output:
> Node Value: abcd
> Node Value: abcd
>
> One would expect:
> Node Value: ab
> Node Value: cd
>
> Is the output correct, should something like this not be handled via
> simpleXML, or is the xsearch incorrect when it returns the parent of a 
> text
> node?

Honestly, I don't think anyone (read: I) never considerer that 
SimpleXML would be used in cases that mix text and element nodes. I 
never encountered this in my use-cases from above.

Currently, I believe the SimpleXML document model assumes that an 
element contains (zero or more elements) or (one text node). So, if you 
take your XML example from above and do:

print $foo

You get:

abcd

There's no way to access "ab" and "cd" as separate entities, so I would 
almost say the consistent answer is to concatenate the two text nodes 
from the XPath query and return just one text node, "abcd".

If you're looking for boundaries, I would tell you "Don't use SimpleXML 
for this because it may not act as you expect."

> Your initial point concerning what SimpleXML is was a good start, but 
> it
> still doesn't define the boundaries of what it is meant to handle. 
> When do
> you tell someone that what they are doing should not be done in 
> SimpleXML?
> This is where I get lost with the API as I don't really know its 
> intended
> limitations.

Does this bring us any closer to defining the boundaries? Would you 
like them shifted? :)

-adam

-- 
adam trachtenberg
adam@trachtenberg.com