SimpleXML: Moving Forward

21 years ago by Adam Maccabee Trachtenberg — view source

unread

In the hopes of moving the discussion forward, I'm going to try and
sum up general consensus. I believe we agree on most issues, so
hopefully it should be easy to come up with the next steps we should
take with SimpleXML.

I know it's late in the PHP 5 process, but I feel that SimpleXML was
designed in somewhat of an ad-hoc manner and has only now reached the
point where we have enough experience using it to really know where
the troublesome issues lurk. (Build one to throw away.)

Since there hasn't yet been a final release of SimpleXML, this is our
last chance to make these changes without worrying about BC and it's
worth a little trouble to get things closer to right earlier rather
than later.

Here's where I think we stand, with the points descending from top to
bottom in order of general agreement. (e.g. We all agree on point #1
and not everyone agrees with point #6.)

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).
It is important to honor the Simple in SimpleXML. SimpleXML will
not try to replicate the entire DOM using an alternative
syntax. Instead, it will implement a "reasonable" (as of now
deliberately vaguely defined) subset of XML manipulation
functions. People who want more comprehensive features can use DOM.
This subset will attempt to be as minimalistic as possible, to keep
the core small. (e.g. Use XPath.) However, common actions will have
an alternative interface when they make SimpleXML easier to
use. (e.g. Implement getChildren() and getAttributes() functions.)

When deciding the behavior of these functions (e.g. Does
getChildren() return just the direct descendents or all children
regardless of depth?), we'll define them to mimic XPath's behavior:
(e.g. /child::node()). This reduces the potential for disagreement
over what is the "correct" way to do things. (I'm just looking for
a way to prevent protracted discussions over issues that have no
clear "right" answers and can never really be solved.)
XPath and validation functions will be available in SimpleXML, but
we will not try to code generic extensions that work with both
SimpleXML and DOM if for no other reason than this is not
guaranteed to be simple. (e.g. SimpleXML must remove from XPath
results nodes that aren't elements, attributes, and text nodes.)
There will be a procedural interface for SimpleXML because nobody
is actively against this, Sterling is strongly for this, and
Sterling is willing to code it. :)
There will be an OO interface for SimpleXML because everybody
except Sterling is for this and because it already exists. :)

Okay, this message is too long, so I will end it now.

If we can get a couple +1s on this, I would then like to move onto the
next steps which would be deciding which functions will be in the
initial release, what their prototypes are, and who should implement
what and by when.

If you have problems, please rebut in line. Please try to be brief,
I'm wordy enough for everyone. :)

-adam

--
adam@trachtenberg.com
author of o'reilly's php cookbook
avoid the holiday rush, buy your copy today!

21 years ago by Andi Gutmans — view source

unread

I pretty much agree with most of this. I am +1 on moving forward and
reaching a plan and implementation ASAP.
I agree with Sterling that we should keep things simple and slim. This will
also allows us to regret this decision in the future and add more
functionality. As I said previously, I would however want to see a few
methods which are still simple and extremely useful, so that in the common
case people won't need to convert the SimpleXML object to DOM.
But seriously, let's move fast on this. It's not right to delay RC1 if
people aren't giving this their best shot.

Andi

At 02:52 AM 1/13/2004 -0500, Adam Maccabee Trachtenberg wrote:

In the hopes of moving the discussion forward, I'm going to try and
sum up general consensus. I believe we agree on most issues, so
hopefully it should be easy to come up with the next steps we should
take with SimpleXML.

I know it's late in the PHP 5 process, but I feel that SimpleXML was
designed in somewhat of an ad-hoc manner and has only now reached the
point where we have enough experience using it to really know where
the troublesome issues lurk. (Build one to throw away.)

Since there hasn't yet been a final release of SimpleXML, this is our
last chance to make these changes without worrying about BC and it's
worth a little trouble to get things closer to right earlier rather
than later.

Here's where I think we stand, with the points descending from top to
bottom in order of general agreement. (e.g. We all agree on point #1
and not everyone agrees with point #6.)

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

It is important to honor the Simple in SimpleXML. SimpleXML will
not try to replicate the entire DOM using an alternative
syntax. Instead, it will implement a "reasonable" (as of now
deliberately vaguely defined) subset of XML manipulation
functions. People who want more comprehensive features can use DOM.

This subset will attempt to be as minimalistic as possible, to keep
the core small. (e.g. Use XPath.) However, common actions will have
an alternative interface when they make SimpleXML easier to
use. (e.g. Implement getChildren() and getAttributes() functions.)

When deciding the behavior of these functions (e.g. Does
getChildren() return just the direct descendents or all children
regardless of depth?), we'll define them to mimic XPath's behavior:
(e.g. /child::node()). This reduces the potential for disagreement
over what is the "correct" way to do things. (I'm just looking for
a way to prevent protracted discussions over issues that have no
clear "right" answers and can never really be solved.)

XPath and validation functions will be available in SimpleXML, but
we will not try to code generic extensions that work with both
SimpleXML and DOM if for no other reason than this is not
guaranteed to be simple. (e.g. SimpleXML must remove from XPath
results nodes that aren't elements, attributes, and text nodes.)

There will be a procedural interface for SimpleXML because nobody
is actively against this, Sterling is strongly for this, and
Sterling is willing to code it. :)

There will be an OO interface for SimpleXML because everybody
except Sterling is for this and because it already exists. :)

Okay, this message is too long, so I will end it now.

If we can get a couple +1s on this, I would then like to move onto the
next steps which would be deciding which functions will be in the
initial release, what their prototypes are, and who should implement
what and by when.

If you have problems, please rebut in line. Please try to be brief,
I'm wordy enough for everyone. :)

-adam

--
adam@trachtenberg.com
author of o'reilly's php cookbook
avoid the holiday rush, buy your copy today!

21 years ago by Rob Richards — view source

unread

From: Adam Maccabee Trachtenberg

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

This goes back to my question on what is the goal of SimpleXML?
Is it supposed to be an easy api to be able to access any xml document or
only not complex ones?
Attributes are handled associative arrays, so given an element with 2
attributes with the same name, but in different namespaces, it wont work:
<foo a:bar="x" b:bar="y">

xpath wont help here either as xsearch returns an array of sxe objects with
the attribute nodes (which causes some additional problems).
Its fine if this would have to be handled in dom, but to me the question
really has never been fully answered.
See also example under the xpath comments for elements containing mixed text
and element nodes.

When deciding the behavior of these functions (e.g. Does
getChildren() return just the direct descendents or all children
regardless of depth?), we'll define them to mimic XPath's behavior:
(e.g. /child::node()). This reduces the potential for disagreement
over what is the "correct" way to do things. (I'm just looking for
a way to prevent protracted discussions over issues that have no
clear "right" answers and can never really be solved.)

Should only be direct descendants. One should be able to navigate the entire
tree (elements/attributes) in a standard way without having to use xpath.
imho, this is one of the biggest reason why the two functions should be
implemented.

XPath and validation functions will be available in SimpleXML, but
we will not try to code generic extensions that work with both
SimpleXML and DOM if for no other reason than this is not
guaranteed to be simple. (e.g. SimpleXML must remove from XPath
results nodes that aren't elements, attributes, and text nodes.)

return types need to be standardized. attributes or getAttributes returns
name/value array, while the current xsearch will return array of a sxe
objects of the attribute node (which stated before is bad in the current
state of simplexml).

Also, consider the following (an element contains a mix of text and element
nodes):
$foo = simplexml_load_string('<foo>ab<foo2>test</foo2>cd</foo>');
$ns = $foo->xsearch('child::text()');
foreach ($ns as $node) {
print "Node Value: ".$node."\n";
}

Output:
Node Value: abcd
Node Value: abcd

One would expect:
Node Value: ab
Node Value: cd

Is the output correct, should something like this not be handled via
simpleXML, or is the xsearch incorrect when it returns the parent of a text
node?

Your initial point concerning what SimpleXML is was a good start, but it
still doesn't define the boundaries of what it is meant to handle. When do
you tell someone that what they are doing should not be done in SimpleXML?
This is where I get lost with the API as I don't really know its intended
limitations.

Still +1 on the getChildren/getAttributes.

Rob

21 years ago by Christian Schneider — view source

unread

Rob Richards wrote:

From: Adam Maccabee Trachtenberg

accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

Hmm... This is somewhat up-side-down language wise. Attributes are
properties of a node and the node contains children like an array
contains elements. But I guess we'll keep it that way for nesting sake
($foo->foo2->foo3 is so much nicer than $foo['foo2']['foo3']).

But let's take a look on how I'd use it (xml formatted for readability):
$foo = simplexml_load_string('
<foo x:a="xa" y:a="ya">
ab
<foo2>foo2a</foo2>
cd
<foo2>foo2b</foo2>
ef
<foo3>
foo3
<foo4>foo4</foo4>
foo3
</foo3>
gh
</foo>');

foreach ($foo as $node) => foo2a foo2b foo3
foreach ($foo->foo2 as $node) => foo2a foo2b
foreach ($foo->foo3 as $node) => foo4
foreach ((array)$foo->foo3 as $node) => foo4
foreach ($foo->foo3->foo4 as $node) => nothing
foreach ((array)$foo->foo3->foo4 as $node) => foo4

What seems wrong here is that to output nodes where there can be 0 to
multiple instances I have to do something like:
if ($foo->$nodename)
{
if (is_array($foo->$nodename))
{
foreach ($foo->$nodename as $node)
echo "$node\n";
}
else
echo "{$foo->$nodename}\n";
}
else
echo "No node $nodename found\n";

$nodename = 'node1' => No node node1 found
$nodename = 'node2' => foo2a foo2b
$nodename = 'node3' => foo3

Attributes are handled associative arrays, so given an element with 2
attributes with the same name, but in different namespaces, it wont work:
<foo a:bar="x" b:bar="y">

Right now foo['bar'] will be an array('x', 'y') in that case. We're
losing the namespaces here but get the values. Simple or broken? Not sure.

Should only be direct descendants. One should be able to navigate the entire
tree (elements/attributes) in a standard way without having to use xpath.

I agree. What about getChildren($level = 1) with $level=0 meaning all?
This offers both functionalities while having a default we decide on.

As right now there is no easy (read non-xpath/xquery) way of getting the
attributes hidden in the magic array of $foo I think getAttributes
should be added too.

No other functions though. Should these be methods? I think so.

$foo = simplexml_load_string('<foo>ab<foo2>test</foo2>cd</foo>');
$ns = $foo->xsearch('child::text()');
foreach ($ns as $node)
print "Node Value: ".$node."\n";

I would actually expect abcd but only once:
Node Value: abcd

Concatenating all text parts and returning them once for each part
definitely seems wrong.

+1 on getChildren/getAttributes (function or method)
-1 on more functions

I think it's quite usable this way and simple enough to use to earn the
name SimpleXML.

Chris

21 years ago by Adam Trachtenberg — view source

unread

But let's take a look on how I'd use it (xml formatted for
readability):
$foo = simplexml_load_string('
<foo x:a="xa" y:a="ya">
ab
<foo2>foo2a</foo2>
cd
<foo2>foo2b</foo2>
ef
<foo3>
foo3
<foo4>foo4</foo4>
foo3
</foo3>
gh
</foo>');

Ugh. This is pretty much the limit of what I think is reasonable for
SimpleXML to handle. It think the API would be more consistent if the
document looked like:

<foo x:a="xa" y:b="yb"> <foo2>foo2a</foo2> <foo2>foo2b</foo2> <foo3> <foo4>foo4</foo4> </foo3> </foo>');

However, that may be placing too many restrictions upon documents to
make SimpleXML useful. Like I said before, I've never tried to use
SimpleXML with text nodes and elements sharing the same parent.

foreach ($foo as $node) => foo2a foo2b foo3
foreach ($foo->foo2 as $node) => foo2a foo2b
foreach ($foo->foo3 as $node) => foo4
foreach ((array)$foo->foo3 as $node) => foo4
foreach ($foo->foo3->foo4 as $node) => nothing
foreach ((array)$foo->foo3->foo4 as $node) => foo4

What seems wrong here is that to output nodes where there can be 0 to
multiple instances I have to do something like:
if ($foo->$nodename)
{
if (is_array($foo->$nodename))
{
foreach ($foo->$nodename as $node)
echo "$node\n";
}
else
echo "{$foo->$nodename}\n";
}
else
echo "No node $nodename found\n";

$nodename = 'node1' => No node node1 found
$nodename = 'node2' => foo2a foo2b
$nodename = 'node3' => foo3

I raised this as an issue yesterday. Sterling said he'd look into this.
However, to tie this into my reply to Rob, I think there's some
expectation that the developer knows what she's getting and that the
cases where you have 0, 1, or many potential elements are few. (That
said, I just developed something where I do essentially this all over
the place and it sucks.)

Here are my thoughts on solutions:

Place all elements in an array (or nodeList) regardless whether
there's 0, 1, or many. This is the DOM solution. This just leads to
annoying code where you need to do $foo->item(0) and $foo->firstChild.

However, I don't really see any way around this otherwise. Either it's
general or not. It can't be both. (Unless there's some magical type
that's both an array and a scalar.) I'm willing to put up with this
headache because the klunkyness here is outweighted by the niceness for
most cases.

If a document has an XML Schema (or RelaxNG schema), SimpleXML could
optionally inspect the schema to see if there are minOccurs and
maxOccurs attributes in the schema for an element. If maxOccurs > 1,
then the elements would be placed in an array even if there was only
one element in that particular instance.

This allows us to solve the problem by making the user specifically
tell us how they want SimpleXML to handle a document. It does add some
overhead, but simplicity is often more complex behind the scenes. This
has the benefit of using an existing XML technology to solve the
problem, but I don't know how expensive it'd be.

Again, my opinion is that arbitrary XML documents are best parsed using
DOM and well-defined ones are best parsed using SimpleXML.

Attributes are handled associative arrays, so given an element with 2
attributes with the same name, but in different namespaces, it wont
work:
<foo a:bar="x" b:bar="y">

Right now foo['bar'] will be an array('x', 'y') in that case. We're
losing the namespaces here but get the values. Simple or broken? Not
sure.

This case still makes me puke. :)

Right now, SimpleXML always makes you lose the namespaces unless you
use XPath. I don't think that's too much to ask that if you can handle
XML Namespaces you can also handle XPath. I would prefer to guide
people through XPath in these nasty cases than make the general API
handle them.

As right now there is no easy (read non-xpath/xquery) way of getting
the attributes hidden in the magic array of $foo I think
getAttributes should be added too.

AFAIK, it's actually also impossible to find out the name of the
document element using SimpleXML, even using XPath.

I ended up doing:

$xml = simplexml_load_string($data);
$type = dom_import_simplexml($xml)->tagName;

Without this feature, it's difficult to make SimpleXML work in cases
where a page could be potentially processing two different XML
documents because you can't inspect the XML document to figure out what
type it is. :)

No other functions though. Should these be methods? I think so.

$foo = simplexml_load_string('<foo>ab<foo2>test</foo2>cd</foo>');
$ns = $foo->xsearch('child::text()');
foreach ($ns as $node)
print "Node Value: ".$node."\n";

I would actually expect abcd but only once:
Node Value: abcd

Concatenating all text parts and returning them once for each part
definitely seems wrong.

Aren't those two lines contradictory? :)

+1 on getChildren/getAttributes (function or method)
-1 on more functions

I think it's quite usable this way and simple enough to use to earn
the name SimpleXML.

I think this is where we're coming out. (Modulo the XPath and
Validation functions.)

-adam

--
adam trachtenberg
adam@trachtenberg.com

21 years ago by Adam Maccabee Trachtenberg — view source

unread

Here are my thoughts on solutions:

Place all elements in an array (or nodeList) regardless whether
there's 0, 1, or many. This is the DOM solution. This just leads to
annoying code where you need to do $foo->item(0) and $foo->firstChild.

However, I don't really see any way around this otherwise. Either it's
general or not. It can't be both. (Unless there's some magical type
that's both an array and a scalar.) I'm willing to put up with this
headache because the klunkyness here is outweighted by the niceness for
most cases.

To clarify, I prefer NOT to put everything in an array like DOM. The
headache I'm willing to put up with is the if(is_array()) code,
especially if it turns out to be feasable to do the XML Schema linkage
I suggested.

-adam

--
adam@trachtenberg.com
author of o'reilly's php cookbook
avoid the holiday rush, buy your copy today!

21 years ago by Christian Schneider — view source

unread

Adam Trachtenberg wrote:

Ugh. This is pretty much the limit of what I think is reasonable for
SimpleXML to handle. It think the API would be more consistent if the

Agreed. I was just curious how it behaves if I push it to the limit :-)

However, I don't really see any way around this otherwise. Either it's
general or not. It can't be both. (Unless there's some magical type
that's both an array and a scalar.) I'm willing to put up with this

Elements are already a magical type which is an object and an array.
Making foreach work on a both the scalar and array incantation of child
elements seems very useful to me. And usefulness seems to be the goal of
SimpleXML as far as I understand.

maxOccurs attributes in the schema for an element. If maxOccurs > 1,
then the elements would be placed in an array even if there was only one
element in that particular instance.

I don't like the idea: Different behaviour with or without schema. I
write code without schema first. If I decide to add a schema later code
has to be rewritten. Not good.

Again, my opinion is that arbitrary XML documents are best parsed using
DOM and well-defined ones are best parsed using SimpleXML.

I agree but I think I'll have something like
<config>

<option>1</option> <option>2</option> ... </config> with 0 - many options quite often and iterating should be a simple foreach IMHO.

AFAIK, it's actually also impossible to find out the name of the
document element using SimpleXML, even using XPath.

Yup, right now the only way is probably to wrap it in a dummy tag before
giving it to SimpleXML. Sounds like a good idea anyway to me, if I have
a domain specific document with varying content I'd probably do
<domain>...</domain> anyway.

I would actually expect abcd but only once:
Node Value: abcd

Concatenating all text parts and returning them once for each part
definitely seems wrong.

Aren't those two lines contradictory? :)

Why? Right now it returns abcd twice which is definitely wrong.
Returning ab and cd or (preferably IMHO) abcd once seems right.

Chris

21 years ago by Adam Maccabee Trachtenberg — view source

unread

Adam Trachtenberg wrote:

However, I don't really see any way around this otherwise. Either it's
general or not. It can't be both. (Unless there's some magical type
that's both an array and a scalar.) I'm willing to put up with this

Elements are already a magical type which is an object and an array.
Making foreach work on a both the scalar and array incantation of child
elements seems very useful to me. And usefulness seems to be the goal of
SimpleXML as far as I understand.

If that can be done, then I am all for it. Maybe we can somehow make
an individual item Iterable.

maxOccurs attributes in the schema for an element. If maxOccurs > 1,
then the elements would be placed in an array even if there was only one
element in that particular instance.

I don't like the idea: Different behaviour with or without schema. I
write code without schema first. If I decide to add a schema later code
has to be rewritten. Not good.

Don't do that. Use the schema. :)

AFAIK, it's actually also impossible to find out the name of the
document element using SimpleXML, even using XPath.

Yup, right now the only way is probably to wrap it in a dummy tag before
giving it to SimpleXML. Sounds like a good idea anyway to me, if I have
a domain specific document with varying content I'd probably do
<domain>...</domain> anyway.

Unless, of course, you're not in control over those documents.

I would actually expect abcd but only once:
Node Value: abcd

Concatenating all text parts and returning them once for each part
definitely seems wrong.

Aren't those two lines contradictory? :)

Why? Right now it returns abcd twice which is definitely wrong.
Returning ab and cd or (preferably IMHO) abcd once seems right.

Sorry. I misread. I thought you wrote "returning them once definitely
seems wrong." I agree that we should return "abcd" in this case.

-adam

--
adam@trachtenberg.com
author of o'reilly's php cookbook
avoid the holiday rush, buy your copy today!

21 years ago by Christian Schneider — view source

unread

Adam Maccabee Trachtenberg wrote:

Don't do that. Use the schema. :)

Is that why it's called SimpleXML? ;-)
I don't think one should force people to use schemata for now.

Unless, of course, you're not in control over those documents.

You can simply wrap it when passing it to SimpleXML, e.g.
simplexml_load_string("<mydummytag>$xmldata</mydummytag>"), no?

Chris

21 years ago by Adam Trachtenberg — view source

unread

From: Adam Maccabee Trachtenberg

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

This goes back to my question on what is the goal of SimpleXML?
Is it supposed to be an easy api to be able to access any xml document
or
only not complex ones?

Here's where I see the benefit of SimpleXML. SimpleXML should be used
when you know the schema of an XML document and want to extract
specific pieces of data from it. My favorite use-cases are: RSS, REST,
and configuration files.

This doesn't mean there's necessarily a formal XML Schema or RelaxNG
document, but that the developer is familiar enough with the layout of
the XML document that she knows what she's looking for and can
formulate code to access the information she wants.

In most cases, this will be through directly accessing text nodes
through $foo->bar->baz. More complex cases will be handled using XPath:
/rss:foo[begins-with('dc:bar', '2004-01')]/rss:baz.

In my ideal world, you can use SimpleXML for all XML documents,
regardless of complexity (read: namespaces, right?). However, if this
lead to an unnecessary amount of complexity, I would sacrifice this
point.

Also, since there's some assumption of developer fore-knowledge of the
document's schema, there's no need for an overwhelming set of
introspection functions, since that's where DOM excels.

To sum up: it would be helpful to see some real world XML documents
that people want to parse using SimpleXML.
We'd then try very hard to make sure SimpleXML was easy to use for
those documents. It's easy to make up theoretical XML documents that
are well-formed and pathologically nasty, but it'd much prefer to leave
those to DOM.

Attributes are handled associative arrays, so given an element with 2
attributes with the same name, but in different namespaces, it wont
work:
<foo a:bar="x" b:bar="y">

xpath wont help here either as xsearch returns an array of sxe objects
with
the attribute nodes (which causes some additional problems).
Its fine if this would have to be handled in dom, but to me the
question
really has never been fully answered.
See also example under the xpath comments for elements containing
mixed text
and element nodes.

Ugh. That's nasty. I would prefer to not handle this in SimpleXML. Have
you really even seen a case where someone did this?

When deciding the behavior of these functions (e.g. Does
getChildren() return just the direct descendents or all children
regardless of depth?), we'll define them to mimic XPath's behavior:
(e.g. /child::node()). This reduces the potential for disagreement
over what is the "correct" way to do things. (I'm just looking for
a way to prevent protracted discussions over issues that have no
clear "right" answers and can never really be solved.)

Should only be direct descendants. One should be able to navigate the
entire
tree (elements/attributes) in a standard way without having to use
xpath.
imho, this is one of the biggest reason why the two functions should be
implemented.

I agree here.

XPath and validation functions will be available in SimpleXML, but
we will not try to code generic extensions that work with both
SimpleXML and DOM if for no other reason than this is not
guaranteed to be simple. (e.g. SimpleXML must remove from XPath
results nodes that aren't elements, attributes, and text nodes.)

return types need to be standardized. attributes or getAttributes
returns
name/value array, while the current xsearch will return array of a sxe
objects of the attribute node (which stated before is bad in the
current
state of simplexml).

I also agree here. This is one of the reasons I feel it's important to
hash out these details now, so that all the functions work
consistently.

I would prefer to always return an array (or a SimpleXML_List object
that's similar to DOM nodeList) of SimpleXML objects from any querying
function, whether it's getChildren(), getAttributes, or xPathQuery(). I
think this is most consistent.

For example:

Therefore, $xml->xPathQuery('/foo/bar') and $foo->getChildren() (and
maybe $foo?) would be equivalent.

Also, consider the following (an element contains a mix of text and
element
nodes):
$foo = simplexml_load_string('<foo>ab<foo2>test</foo2>cd</foo>');
$ns = $foo->xsearch('child::text()');
foreach ($ns as $node) {
print "Node Value: ".$node."\n";
}

Output:
Node Value: abcd
Node Value: abcd

One would expect:
Node Value: ab
Node Value: cd

Is the output correct, should something like this not be handled via
simpleXML, or is the xsearch incorrect when it returns the parent of a
text
node?

Honestly, I don't think anyone (read: I) never considerer that
SimpleXML would be used in cases that mix text and element nodes. I
never encountered this in my use-cases from above.

Currently, I believe the SimpleXML document model assumes that an
element contains (zero or more elements) or (one text node). So, if you
take your XML example from above and do:

print $foo

You get:

abcd

There's no way to access "ab" and "cd" as separate entities, so I would
almost say the consistent answer is to concatenate the two text nodes
from the XPath query and return just one text node, "abcd".

If you're looking for boundaries, I would tell you "Don't use SimpleXML
for this because it may not act as you expect."

Your initial point concerning what SimpleXML is was a good start, but
it
still doesn't define the boundaries of what it is meant to handle.
When do
you tell someone that what they are doing should not be done in
SimpleXML?
This is where I get lost with the API as I don't really know its
intended
limitations.

Does this bring us any closer to defining the boundaries? Would you
like them shifted? :)

-adam

--
adam trachtenberg
adam@trachtenberg.com

21 years ago by Sterling Hughes — view source

unread

From: Adam Maccabee Trachtenberg

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

This goes back to my question on what is the goal of SimpleXML?
Is it supposed to be an easy api to be able to access any xml document or
only not complex ones?
Attributes are handled associative arrays, so given an element with 2
attributes with the same name, but in different namespaces, it wont work:
<foo a:bar="x" b:bar="y">

Attributes should be ns qualified within the array:

$node->foo['a:bar']

This would respect namespaces "registered" by register_ns()

xpath wont help here either as xsearch returns an array of sxe objects with
the attribute nodes (which causes some additional problems).
Its fine if this would have to be handled in dom, but to me the question
really has never been fully answered.
See also example under the xpath comments for elements containing mixed text
and element nodes.

XPath and validation functions will be available in SimpleXML, but
we will not try to code generic extensions that work with both
SimpleXML and DOM if for no other reason than this is not
guaranteed to be simple. (e.g. SimpleXML must remove from XPath
results nodes that aren't elements, attributes, and text nodes.)

I've decided (unless some more people pipe up support for removing
children() and attributes() its current 2-3 against) to leave children()
and attributes(), but remove the other methods. Things like schema
validation and xpath queries will become procedural.

return types need to be standardized. attributes or getAttributes returns
name/value array, while the current xsearch will return array of a sxe
objects of the attribute node (which stated before is bad in the current
state of simplexml).

xsearch will become a procedure, simplexml_query($node, 'expression', $matches);

Also, consider the following (an element contains a mix of text and element
nodes):
$foo = simplexml_load_string('<foo>ab<foo2>test</foo2>cd</foo>');
$ns = $foo->xsearch('child::text()');
foreach ($ns as $node) {
print "Node Value: ".$node."\n";
}

Output:
Node Value: abcd
Node Value: abcd

One would expect:
Node Value: ab
Node Value: cd

Is the output correct, should something like this not be handled via
simpleXML, or is the xsearch incorrect when it returns the parent of a text
node?

Yep, this is the intended behaviour of simplexml. The
simplexml_save_string() function will allow you to get the entire node
contents (including tags). As for processing text childs separately,
use DOM. It can interpret the same results of an XPath query in the
manner you desire.

Your initial point concerning what SimpleXML is was a good start, but it
still doesn't define the boundaries of what it is meant to handle. When do
you tell someone that what they are doing should not be done in SimpleXML?
This is where I get lost with the API as I don't really know its intended
limitations.

Well, this is the purpose in finalizing the underlying API. The answer
to that question is simple, if it doesn't do what you want, then
SimpleXML is not what you want. This is part of the reason I want to
finalize on no methods. If you need methods, use DOM, the two are
totally interoperable, requiring zero document copies to work with both.
You can process a DOM object then load it into simplexml for the final
processing. Conversely you can take a simplexml object and load it into
DOM for complex processing.

I'm certainly stopping the API at children() and attributes(),
regardless, as anything else is just silly, and it seems that only
people felt strongly about these two functions. (*)

Schema validation and XPath searching will become functions in
SimpleXML space.

-Sterling

(*) Btw, getChildren() is currently broken from a userspace perspective
as it is mainly implemented for the SPL recursive iterator. This will
have to change, simplexml will not add userspace APIs for other
extensions.

21 years ago by Zeev Suraski — view source

unread

At 14:33 13/01/2004, Rob Richards wrote:

From: Adam Maccabee Trachtenberg

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

This goes back to my question on what is the goal of SimpleXML?
Is it supposed to be an easy api to be able to access any xml document or
only not complex ones?

The way I see it - yes. Clarity and simplicity are much more important in
SimpleXML than being able to analyze every last XML document, that's at
least my opinion, and this is what caught the attention of people just
about anywhere I demo'd it.
Please note that SimpleXML takes after BEA's implementation of XML support
in ECMAScript. It may not be a bad idea to see what decisions they took in
that context.

Zeev

21 years ago by Andrei Zmievski — view source

unread

If we can get a couple +1s on this, I would then like to move onto the
next steps which would be deciding which functions will be in the
initial release, what their prototypes are, and who should implement
what and by when.

+1.

-Andrei

"Aim for success, not perfection. Never give up your right to be wrong,
because then you will lose the ability to learn new things and move
forward with your life. Remember that fear always lurks behind
perfectionism. Confronting your fears and allowing yourself the right to
be human can, paradoxically, make yourself a happier and more productive
person."
-- Dr. David M. Burns

21 years ago by Marcus Boerger — view source

unread

Hello Adam,

Thanks for moving backward. Since iterating is now worthless i am all for
removing it completley. I mean it isn't even in the spirit of the extension.
I will sleep over this tonight and probably remove the work of another full
week too. Just because it is too complex and doesn't fit in the current
scheme of SimpleXMl.

Tuesday, January 13, 2004, 8:52:37 AM, you wrote:

In the hopes of moving the discussion forward, I'm going to try and
sum up general consensus. I believe we agree on most issues, so
hopefully it should be easy to come up with the next steps we should
take with SimpleXML.

I know it's late in the PHP 5 process, but I feel that SimpleXML was
designed in somewhat of an ad-hoc manner and has only now reached the
point where we have enough experience using it to really know where
the troublesome issues lurk. (Build one to throw away.)

Since there hasn't yet been a final release of SimpleXML, this is our
last chance to make these changes without worrying about BC and it's
worth a little trouble to get things closer to right earlier rather
than later.

Here's where I think we stand, with the points descending from top to
bottom in order of general agreement. (e.g. We all agree on point #1
and not everyone agrees with point #6.)

SimpleXML creates PHP data structures from XML documents. It only
handles XML elements, attributes, and text nodes. The syntax for
accessing the text node children of an element is akin to object
properties ($foo->bar); the syntax of accessing attributes is akin
to array elements ($foo['bar']).

It is important to honor the Simple in SimpleXML. SimpleXML will
not try to replicate the entire DOM using an alternative
syntax. Instead, it will implement a "reasonable" (as of now
deliberately vaguely defined) subset of XML manipulation
functions. People who want more comprehensive features can use DOM.

This subset will attempt to be as minimalistic as possible, to keep
the core small. (e.g. Use XPath.) However, common actions will have
an alternative interface when they make SimpleXML easier to
use. (e.g. Implement getChildren() and getAttributes() functions.)

When deciding the behavior of these functions (e.g. Does
getChildren() return just the direct descendents or all children
regardless of depth?), we'll define them to mimic XPath's behavior:
(e.g. /child::node()). This reduces the potential for disagreement
over what is the "correct" way to do things. (I'm just looking for
a way to prevent protracted discussions over issues that have no
clear "right" answers and can never really be solved.)

XPath and validation functions will be available in SimpleXML, but
we will not try to code generic extensions that work with both
SimpleXML and DOM if for no other reason than this is not
guaranteed to be simple. (e.g. SimpleXML must remove from XPath
results nodes that aren't elements, attributes, and text nodes.)

There will be a procedural interface for SimpleXML because nobody
is actively against this, Sterling is strongly for this, and
Sterling is willing to code it. :)

There will be an OO interface for SimpleXML because everybody
except Sterling is for this and because it already exists. :)

Okay, this message is too long, so I will end it now.

If we can get a couple +1s on this, I would then like to move onto the
next steps which would be deciding which functions will be in the
initial release, what their prototypes are, and who should implement
what and by when.

If you have problems, please rebut in line. Please try to be brief,
I'm wordy enough for everyone. :)

-adam

--
adam@trachtenberg.com
author of o'reilly's php cookbook
avoid the holiday rush, buy your copy today!

--
Best regards,
Marcus mailto:helly@php.net

45 years ago by Sterling Hughes — view source

unread

Hello Adam,

Thanks for moving backward. Since iterating is now worthless i am all for
removing it completley. I mean it isn't even in the spirit of the extension.
I will sleep over this tonight and probably remove the work of another full
week too. Just because it is too complex and doesn't fit in the current
scheme of SimpleXMl.

Marcus - why remove iterators? Just because we remove their user level
API's, I see no reason to remove the internal implementations?

-Sterling

--
"Reductionists like to take things apart. The rest of us are
just trying to get it together."
- Larry Wall, Programming Perl, 3rd Edition

21 years ago by Marcus Boerger — view source

unread

Hello Sterling,

Saturday, January 19, 1980, 12:10:39 PM, you wrote:

Hello Adam,

Thanks for moving backward. Since iterating is now worthless i am all for
removing it completley. I mean it isn't even in the spirit of the extension.
I will sleep over this tonight and probably remove the work of another full
week too. Just because it is too complex and doesn't fit in the current
scheme of SimpleXMl.

Marcus - why remove iterators? Just because we remove their user level
API's, I see no reason to remove the internal implementations?

According to this weeks discussion there was a pro on not removing the
functions we had right now. That includes the iterator user functions as
well which are needed for interaction between sxe and spl or for other user
space specializations of iteration.

Further more we (ppl not involved in developing sxe at all) discussed to
reduced sxe to the level described in readme which i think they have not
read at all. But be it that way.

My plan now - since sxe is absolutly worthless for me - is to complete my
work to fully enable inhertance on sxe objects. That is a derived sxe object
will create sxe objects of the same type when returning child elements as
objects. If anyone disagrees here i will mark the class as final and be
done. But speak now because i can waste my time with better things.

After that fix i will move all (userspace) iteration mechanisms to an
inherited class inside spl. Since an infrastructure change is needed for
that spl finally has to wait for 5.1 or even 6 (or in that case) i need
to reinvent the wheel completley. Anyway this is what i will do. [This i
will do not only because i like iteration but because i did it for two
reasons. First being simplicity for several easy things and second for
optimization in several scenarios. Even though i still haven't prooved the
latter.]

Coming back to the iteration within sxe. NO - i repeat - no php object
that offeres array overloading is forced to behave like an array when it
comes to iteration so i do not see any reason any longer this should be the
case for sxe objects. If there is anyone please tell me. [And further more no
object can be used in array functions even though i already should it would
be possible for several object types in all but two array functions.]

And now back to comparing xpath with with iteration. First question is why
don't we drop the xpath search method and implement this as aray access
which would look much better to my eyes then and we could drop another
method. Then i also wonder why people think forcing other people to learn
xpath is simplier or easier then using iteration? But as this was an ofter
raised oppinion in the last days i think we should drop iteration
completley (or implement it fully as we had before).

But maybe i am wrong and it is only the complex iteration strategies people
wanted to drop and with that they wanted to disallow others to implement
such things because it can be done with xpath expressions too.

I am sorry for the long mail and it would have been better in the first
place if only the main developers behind the extension would have been asked
before a change and agreed upon one.

--
Best regards,
Marcus mailto:helly@php.net

21 years ago by Rob Richards — view source

unread

Thanks for moving backward. Since iterating is now worthless i am all
for
removing it completley. I mean it isn't even in the spirit of the
extension.
I will sleep over this tonight and probably remove the work of another
full
week too. Just because it is too complex and doesn't fit in the current
scheme of SimpleXMl.

Marcus,

Ignore the user space issue for right now as I dont go into that at all. I
dont see iterators as being worthless, however there is a behavior clash
between the iterators and the arrays.
When fetching properties returning arrays for multiple elements and
returning a single sxe object for single elements (as the single sxe object
will iterate the children of the returned object). A foreach will give
different results depending upon which is returned.
$a = $foo->a;
foreach($a->b as $b) {
...
}

imo, the behavior should be the same regardless of what is returned. This
pretty much means generalizing things to use either arrays or iterators.

Disclaimer: Not advocating the following, just.may be a possible solution
depending upon the ultimate scope of sxe.

One nice thing is that if it were implemented as iterators, one could do:
$b = $a->b;
print $b;

as $b should be the first returned sxe object, which would have the the
iterator created (How to handle no elements would need to be determined -
either NULL or an some tpye of empty sxe object).
It could be thought of in terms of how the nodelist was
implemented in dom, yet this is much scaled down and is implemented directly
on the sxe object. This would also allow for other methods (such as
children, attrbutes, count, etc.. if any are to be implemented), to be
plugged right in, allowing for things like:
$foo->b->count(), $foo->a->children()->count(), $foo->a->attributes(),
etc...

However to implement something like above as well as to implement iterators
in sxe at all, they are going to need to point to the node proxies rather
than to the libxml nodes themselves otherwise it can be made to segfault (as
is the case today):
$foo =
simplexml_load_string('<foo><a><b1>b1val</b1><b2>b2val</b2></a></foo>
');
$a = $foo->a;
foreach($a as $b) {
foreach($b as $subb) {
print $subb."\n";
unset($foo->a);
}
}

Implementing these iterators wouldnt be as simple as they previously were,
but would offer a lot more functionality and flexibilty (interally wise)

Rob

21 years ago by Adam Maccabee Trachtenberg — view source

unread

Ignore the user space issue for right now as I dont go into that at all. I
dont see iterators as being worthless, however there is a behavior clash
between the iterators and the arrays.

I really like Iterators, so I'd like to see this ironed out. Based on
what I've seen from Marcus at ApacheCon you can do really cool stuff
with them and I don't think there's any reason to delete them if they
can be implemented consistently alongside the rest of the
extension. (Which I'm sure they can be.)

When fetching properties returning arrays for multiple elements and
returning a single sxe object for single elements (as the single sxe object
will iterate the children of the returned object). A foreach will give
different results depending upon which is returned.
$a = $foo->a;
foreach($a->b as $b) {
...
}

imo, the behavior should be the same regardless of what is returned. This
pretty much means generalizing things to use either arrays or iterators.

This is one of my problems because even if you know a document's
Schema, you can still run into instances where a node may have 0, 1,
or more subelements of the same name.

I know this has been discussed before. (I think I remember George
being involved, but I can't remember who else.)

The current solution is to do:

if (is_set($foo->a)) {
if (is_array($foo->a)) {
foreach($foo->a as $a) {
// blah with $a
}
} else {
// blah with $foo->a
}
}

It would be a "Real Good Thing" (tm), if that could be turned into this:

foreach($foo->a as $a) {
// blah $a
}

If this can be done using Iterators, I see it as a strongly compelling
reason to use them. I think this is a clear-cut win in terms of
programming simplicity (and I should know since I've got over 2,000
lines of code that's either the little dance above or a variation of it).

However to implement something like above as well as to implement iterators
in sxe at all, they are going to need to point to the node proxies rather
than to the libxml nodes themselves otherwise it can be made to segfault (as
is the case today):
$foo =
simplexml_load_string('<foo><a><b1>b1val</b1><b2>b2val</b2></a></foo>
');
$a = $foo->a;
foreach($a as $b) {
foreach($b as $subb) {
print $subb."\n";
unset($foo->a);
}
}

Nasty crashes are bad. :)

-adam

--
adam@trachtenberg.com
author of o'reilly's php cookbook
avoid the holiday rush, buy your copy today!

21 years ago by Marcus Boerger — view source

unread

Hello Adam,

Thursday, January 15, 2004, 6:01:58 PM, you wrote:

Ignore the user space issue for right now as I dont go into that at all. I
dont see iterators as being worthless, however there is a behavior clash
between the iterators and the arrays.

I really like Iterators, so I'd like to see this ironed out. Based on
what I've seen from Marcus at ApacheCon you can do really cool stuff
with them and I don't think there's any reason to delete them if they
can be implemented consistently alongside the rest of the
extension. (Which I'm sure they can be.)

Just a node: The thing you saw required SXE objects implementing
interface RecursiveIterator{reset(), hasMore(), key(), current(),
next(), hasChildren(), getChildren()}

When fetching properties returning arrays for multiple elements and
returning a single sxe object for single elements (as the single sxe object
will iterate the children of the returned object). A foreach will give
different results depending upon which is returned.
$a = $foo->a;
foreach($a->b as $b) {
...
}

imo, the behavior should be the same regardless of what is returned. This
pretty much means generalizing things to use either arrays or iterators.

This is one of my problems because even if you know a document's
Schema, you can still run into instances where a node may have 0, 1,
or more subelements of the same name.

That's exactly why dimity tried to implement $obj->node[0]. And i guess this
is vital to the extension.

I know this has been discussed before. (I think I remember George
being involved, but I can't remember who else.)

The current solution is to do:

if (is_set($foo->a)) {
if (is_array($foo->a)) {
foreach($foo->a as $a) {
// blah with $a
}
} else {
// blah with $foo->a
}
}

It would be a "Real Good Thing" (tm), if that could be turned into this:

foreach($foo->a as $a) {
// blah $a
}

To achieve that the property ready handler would have to return an array or
iterator class always. That's all that needs to be changed. Unfortunatley
this would be a major change of functionality so feel free to drop proeprty
handling completley.

If this can be done using Iterators, I see it as a strongly compelling
reason to use them. I think this is a clear-cut win in terms of
programming simplicity (and I should know since I've got over 2,000
lines of code that's either the little dance above or a variation of it).

However to implement something like above as well as to implement iterators
in sxe at all, they are going to need to point to the node proxies rather
than to the libxml nodes themselves otherwise it can be made to segfault (as
is the case today):
$foo =
simplexml_load_string('<foo><a><b1>b1val</b1><b2>b2val</b2></a></foo>
');
$a = $foo->a;
foreach($a as $b) {
foreach($b as $subb) {
print $subb."\n";
unset($foo->a);
}
}

I guess it can be done by simply fixing the refcounting. So why not start
some work/analysis instead of mucking around? The least bit you could do is
writing some test files under test subdirectory.

Nasty crashes are bad. :)

-adam

--
Best regards,
Marcus mailto:helly@php.net

21 years ago by Christian Schneider — view source

unread

Marcus Boerger wrote:

Just a node: The thing you saw required SXE objects implementing
interface RecursiveIterator{reset(), hasMore(), key(), current(),
next(), hasChildren(), getChildren()}

Excuse my ignorance: As I wasn't at ApacheCon I'm not sure what Adam was
talking about, is the SXE/RecursiveIterator stuff what makes foreach
possible?

That's exactly why dimity tried to implement $obj->node[0]. And i guess this
is vital to the extension.

Again: Is $obj->node[0] needed to provide foreach in all cases (0, 1 or
more elements)? Or is this another way of accessing the elements?

To achieve that the property ready handler would have to return an array or
iterator class always. That's all that needs to be changed. Unfortunatley
this would be a major change of functionality so feel free to drop proeprty
handling completley.

Again I can't follow the conclusion your drawing. I think people on the
list here expressed what they'd like to see as functionality (i.e. how
they'd use SimpleXML). If it needs a major change so be it.

I guess it can be done by simply fixing the refcounting. So why not start
some work/analysis instead of mucking around? The least bit you could do is
writing some test files under test subdirectory.

I think the problem right now is that noone feels responsible to go and
start coding. Old dilemma: If you start coding too early people get
upset because you didn't ask first. And if you want to nail down the
specs first you get accused of 'only mucking around'.

So who's responsible for SimpleXML now and what are the specs? I'd be
willing to provide some code if the person in charge wants help, I don't
have the knowledge or karma on this list to lead the development though,
that's for sure. So give me specs and I give you e.g. test cases :-)

Chris

21 years ago by Marcus Boerger — view source

unread

Hello Christian,

Friday, January 16, 2004, 12:20:13 PM, you wrote:

Marcus Boerger wrote:

Just a node: The thing you saw required SXE objects implementing
interface RecursiveIterator{reset(), hasMore(), key(), current(),
next(), hasChildren(), getChildren()}

Excuse my ignorance: As I wasn't at ApacheCon I'm not sure what Adam was
talking about, is the SXE/RecursiveIterator stuff what makes foreach
possible?

During Frankfurt Conf Sterling and me discused the general foreach things
and i made them work. During ApacheCon i started with the RecursiveIterator
and finally made the whole foreach thing work. After the show i also made
the RecursiveIterator work. That's a typical form of php development.
Sometimes we meet live or on irc, discuss ideas and implement them. The
current situation is very much different as you should know by now.

That's exactly why dimity tried to implement $obj->node[0]. And i guess this
is vital to the extension.

Again: Is $obj->node[0] needed to provide foreach in all cases (0, 1 or
more elements)? Or is this another way of accessing the elements?

The term '->xyz' (or in other words property access) could address different
things. If there is only one node of the same name then it returns the
object for that node. If there are more nodes of the same type then it
returns an array of the node objects. Hence '->xyz[<n>]' allows to always
return a single node object and therefore '->xyz[0]' always returns the node
object for the first node no matter how many nodes of the same name there
are.

To achieve that the property ready handler would have to return an array or
iterator class always. That's all that needs to be changed. Unfortunatley
this would be a major change of functionality so feel free to drop proeprty
handling completley.

Again I can't follow the conclusion your drawing. I think people on the
list here expressed what they'd like to see as functionality (i.e. how
they'd use SimpleXML). If it needs a major change so be it.

That's a problem though. If people don't understand i am fine with them
telling us what they would like to see. But believe me Sterling, Rob and me
have spend lots of time in implementing this. Somtimes such sentences are
only said to denote that there are certain problems. And sometimes i think
it is ok if only a few people understand the full thing. (bla bla)

I guess it can be done by simply fixing the refcounting. So why not start
some work/analysis instead of mucking around? The least bit you could do is
writing some test files under test subdirectory.

I think the problem right now is that noone feels responsible to go and
start coding. Old dilemma: If you start coding too early people get
upset because you didn't ask first. And if you want to nail down the
specs first you get accused of 'only mucking around'.

So who's responsible for SimpleXML now and what are the specs? I'd be
willing to provide some code if the person in charge wants help, I don't
have the knowledge or karma on this list to lead the development though,
that's for sure. So give me specs and I give you e.g. test cases :-)

Fine. Responsible are Sterling, me and rob. Also Andi is the RM so as long
as we haven't released php 5.0 those three persons are the ones who should
do the main decisions.

Things that need to be adressed that can be fixed inside the extension and
that won't affect other behavior are handling of attribute values by
proxies. Currently you cannot do '$obj["attribute"] += $value;'. This is
because the dimension read handler returns the value. This must be changed
to a proxy class so that changing that value would result in an update of
the original object. If you really want to do any coding then that's the
place to start.

--
Best regards,
Marcus mailto:helly@php.net

21 years ago by Christian Schneider — view source

unread

Marcus Boerger wrote:

Thanks for moving backward. Since iterating is now worthless i am all for
removing it completley. I mean it isn't even in the spirit of the extension.

Care to explain? I'm dazed and confused on how you came to the
conclusion that iterators are now worthless...

Was this discussed off-list? No message on the list seems indicating
that to me but maybe I missed it. Or I don't understand what you mean by
'iterators' or 'worthless'.

Wanting to understand your frustration,

Chris