Howdy,
I have a 12Meg xml file that I am trying to get PHP 4.3.1 to parse
using libxml2 with xmldoc(), and it takes ~ 11 minutes to do this on a
P4 1.8Hhz box.
The same operation in libxml2's python interface takes 6 seconds.
here is my code. Any ideas why it takes 11 minutes? 12Megs of valid
xml is large, but not 11 minutes worth of large.
<?php
//my includes here
set_time_limit( 345600 );
$xml = implode('', file("foo_3.xml"));
debug_message("parsing xml file of len = ".strlen($xml));
$doc = xmldoc( $xml );
//it takes 11 minutes to get here!
debug_message("GOT THE DOC");
exit;
?>
The output
[waboring@hemna bin]$ time php -f xml.php
PHP Notice:
DEBUG:(16:04:00):/home/waboring/devel/freemap/bin/xml.php::(11) parsing
xml file of len = 11223576
in /home/waboring/devel/freemap/lib/util.inc on line 784
PHP Notice:
DEBUG:(16:13:55):/home/waboring/devel/freemap/bin/xml.php::(16) GOT THE
DOC
in /home/waboring/devel/freemap/lib/util.inc on line 784
real 9m56.163s
user 8m57.790s
sys 0m4.280s
python tst.py that comes with libxml2-python package.
#!/usr/bin/python -u
import sys
import libxml2
Memory debug specific
libxml2.debugMemory(1)
doc = libxml2.parseFile("tst.xml")
if doc.name != "tst.xml":
print "doc.name failed"
sys.exit(1)
root = doc.children
print "doc.name = "+root.name
#if root.name != "doc":
print "root.name failed"
sys.exit(1)
child = root.children
#if child.name != "foo":
print "child.name failed"
sys.exit(1)
doc.freeDoc()
Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
print "OK"
else:
print "Memory leak %d bytes" % (libxml2.debugMemory(1))
libxml2.dumpMemory()
output of
[waboring@hemna bin]$ time ./tst.py
doc.name = MAP
OK
real 0m3.413s
user 0m2.860s
sys 0m0.390s
Ok, so I'm not 100% sure that the python libxml2.parseFile() does
EXACTLY the same php's xmldoc(), but I don't think xmldoc() should take
11 minutes. It also sucks up a ton of ram.
Walt
At 01:37 02.05.2003, Walt Boring wrote:
BTW tst.xml and foo_3.xml are the same file.
You may like to hear that we are doing much work on xml support for
upcoming php5.
marcus
BTW tst.xml and foo_3.xml are the same file.
Walt
Howdy,
I have a 12Meg xml file that I am trying to get PHP 4.3.1 to parse
using libxml2 with xmldoc(), and it takes ~ 11 minutes to do this on a
P4 1.8Hhz box.The same operation in libxml2's python interface takes 6 seconds.
here is my code. Any ideas why it takes 11 minutes? 12Megs of valid
xml is large, but not 11 minutes worth of large.<?php
//my includes here
set_time_limit( 345600 );$xml = implode('', file("foo_3.xml"));
debug_message("parsing xml file of len = ".strlen($xml));$doc = xmldoc( $xml );
//it takes 11 minutes to get here!
debug_message("GOT THE DOC");
exit;?>
The output
<div style="float: left;"> TOTAL EXECUTION TIME: 595.5</div>
[waboring@hemna bin]$ time php -f xml.php
PHP Notice:
DEBUG:(16:04:00):/home/waboring/devel/freemap/bin/xml.php::(11) parsing
xml file of len = 11223576
in /home/waboring/devel/freemap/lib/util.inc on line 784
PHP Notice:
DEBUG:(16:13:55):/home/waboring/devel/freemap/bin/xml.php::(16) GOT THE
DOC
in /home/waboring/devel/freemap/lib/util.inc on line 784real 9m56.163s
user 8m57.790s
sys 0m4.280spython tst.py that comes with libxml2-python package.
#!/usr/bin/python -u
import sys
import libxml2Memory debug specific
libxml2.debugMemory(1)
doc = libxml2.parseFile("tst.xml")
if doc.name != "tst.xml":
print "doc.name failed"
sys.exit(1)
root = doc.children
print "doc.name = "+root.name#if root.name != "doc":
print "root.name failed"
sys.exit(1)
child = root.children
#if child.name != "foo":print "child.name failed"
sys.exit(1)
doc.freeDoc()
Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
print "OK"
else:
print "Memory leak %d bytes" % (libxml2.debugMemory(1))
libxml2.dumpMemory()output of
[waboring@hemna bin]$ time ./tst.py
doc.name = MAP
OKreal 0m3.413s
user 0m2.860s
sys 0m0.390sOk, so I'm not 100% sure that the python libxml2.parseFile() does
EXACTLY the same php's xmldoc(), but I don't think xmldoc() should take
11 minutes. It also sucks up a ton of ram.Walt
Howdy,
I have a 12Meg xml file that I am trying to get PHP 4.3.1 to parse
using libxml2 with xmldoc(), and it takes ~ 11 minutes to do this on a
P4 1.8Hhz box.The same operation in libxml2's python interface takes 6 seconds.
here is my code. Any ideas why it takes 11 minutes? 12Megs of valid
xml is large, but not 11 minutes worth of large.<?php
//my includes here
set_time_limit( 345600 );$xml = implode('', file("foo_3.xml"));
debug_message("parsing xml file of len = ".strlen($xml));$doc = xmldoc( $xml );
//it takes 11 minutes to get here!
debug_message("GOT THE DOC");
exit;?>
Given the size, it seems better to use domxml_open_file() and bypass
pulling that all of that data into memory?
From what I've seen personally and heard as well, that should considerably
streamline the flow of data:
file -> XML parser
Instead of:
file -> lots of zvals for the strings in PHP taking up much more than
the original 12 MB of data -> imploding all those strings into a single
variable -> passing the data by copying into a variable (no reference shown
in example) -> XML parser
Then you don't have to bring a 12 megabyte datafile into PHP userland by
creating an array for each line and then imploding those lines.
I'd be interested to hear how much of a difference using domxml_open_file()
would make.
Hope it helps!
-- mjh
Mark J. Hershenson wrote:
Given the size, it seems better to use domxml_open_file() and bypass
pulling that all of that data into memory?
Not so long ago I posted to comp.lang.php to raise the discussion about
thread safety of accessing files with domxml_open_file() (and also other
extension specific file-accessing functions). No one replied but I
wanted to know if it was safer to suck the data into memory while using
flock()
on the file so that there was no (less?) danger of getting dirty
reads/writes if the file was being accessed from concurrently running
scripts.
Of course, using flock()
means that you actually have to load the file
into memory first which is obviously bad given your evidence (even when
not using file()
and putting it in a very large hash first).
I'm wandering if it is neccesary to use flock()
on ALL filesystem
interactions in a web based (multi-threaded) application.
Any insight would be appreciated.
PS. This discussion (more so) also applies to
DomDocument->dump_file()
Not so long ago I posted to comp.lang.php to raise the discussion about
thread safety of accessing files with domxml_open_file() (and also other
extension specific file-accessing functions). No one replied but I
wanted to know if it was safer to suck the data into memory while using
flock()
on the file so that there was no (less?) danger of getting dirty
reads/writes if the file was being accessed from concurrently running
scripts.Of course, using
flock()
means that you actually have to load the file
into memory first which is obviously bad given your evidence (even when
not usingfile()
and putting it in a very large hash first).I'm wandering if it is neccesary to use
flock()
on ALL filesystem
interactions in a web based (multi-threaded) application.
Yes, that's necessary, and not only in multithreaded applications, also
in multi-process applications like the Apache webserver. But those
questiosn really belong on the php-general@lists.php.net mailinglist.
Derick
--
"my other box is your windows PC"
Derick Rethans http://derickrethans.nl/
PHP Magazine - PHP Magazine for Professionals http://php-mag.net/
I'm wandering if it is neccesary to use
flock()
on ALL filesystem
interactions in a web based (multi-threaded) application.
flock and fcntl predate the multi-threaded approach on Unix
and as such they have process granularity, i.e. they won't
help in multi-threaded environments.
- Sascha
Sascha Schumann wrote:
I'm wandering if it is neccesary to use
flock()
on ALL filesystem
interactions in a web based (multi-threaded) application.flock and fcntl predate the multi-threaded approach on Unix and as such they have process granularity, i.e. they won't help in multi-threaded environments. - Sascha
OK. Does this mean that there is no way to avoid dirty read/writes
without telling the web server to limit everything to one process?
I'v already started developing an object wrapper to DOM XML for my own
usage (but freely available to anyone as with all my stuff) which
provides a central point where I can control the way all this file
accessing is done.
Here is the constructor method alone plus the constant definitions
preceeding the class. This code has not been tested yet.
/**
* Constructor method
*
* Create the objDoc instance property and associated ndRoot
property based
* on the user-selected mode of document creation. In the process,
establish
* the blnIsReadOnly property.
*
* @param mixed information required to create a DOM document
* @param int constant specifying how the document is to be
created
* @return void
*/
function domDoc(&$Starter,$intUse = DOMDOC_AS_NEW) {
$this->objErr = new xmlErrorLog($this);
$this->intMode = $intUse;
$this->blnIsReadOnly = true;
// for more info on each case
block, see
// comments in the constant definitions
// at the top of this file.
switch ($intUse) {
case DOMDOC_AS_NEW:
$this->blnIsReadOnly = false;
$this->objDoc = domxml_new_doc("1.0");
$elRoot = $this->objDoc->create_element($Starter);
$this->ndRoot = $this->objDoc->append_child($elRoot);
break;
case DOMDOC_AS_READFILE:
$fp = fopen($Starter,"r")
OR $this->objErr->throw(
"could not open ".$Starter." for reading."
);
// attempt thread safety by using flock
flock($fp,LOCK_SH);
$xmlData = fread($fp,filesize($Starter));
flock($fp,LOCK_UN);
fclose($fp);
// finish working with the file as
// quickly as possible and THEN worry
// about making a DOM doc from it.
$this->_domOpenFromData($xmlData);
break;
case DOMDOC_AS_WRITEFILE:
$this->blnIsReadOnly = false;
$this->fp = fopen($Starter,"bw+")
OR $this->objErr->throw(
"could not open ".$Starter." for writing."
);
flock($this->fp,LOCK_EX);
$xmlData = fread($this->fp,filesize($Starter));
$this->_domOpenFromData($xmlData);
break;
case DOMDOC_AS_REMOTEFILE:
$fp = fopen($Starter,"r")
OR $this->objErr->throw(
"could not open ".$Starter." for reading."
);
$xmlData = fread($fp,filesize($Starter));
fclose($fp);
$this->_domOpenFromData($xmlData);
break;
case DOMDOC_AS_REFERENCE:
$this->objDoc =& $Starter;
break;
case DOMDOC_AS_DATA:
$this->_domOpenFromData($Starter);
break;
default:
$this->objErr->throw(
"second argument to domDoc constructor is invalid"
);
}
}
I'v already started developing an object wrapper to DOM XML for my own
usage (but freely available to anyone as with all my stuff) which
provides a central point where I can control the way all this file
accessing is done.Here is the constructor method alone plus the constant definitions
preceeding the class. This code has not been tested yet.
This really doesn't belong on this list, which is for developing the
language PHP, not for developing with PHP.
Derick
--
"my other box is your windows PC"
Derick Rethans http://derickrethans.nl/
PHP Magazine - PHP Magazine for Professionals http://php-mag.net/
Derick Rethans wrote:
This really doesn't belong on this list, which is for developing the
language PHP, not for developing with PHP.Derick
I'm not sure I agree. Particularly if you look at the original question
where Walt asked about the efficiency of some PHP code that he wrote.
All I've done is extend the thread to include discussion on the
threadsafety of at lest function call. A function call which was
developed by an INTERNAL developer. My original question amounts to "is
domxml_open_file() threadsafe". IMO this is not a GENERAL list question.
Anyway, it's counterproduction to generate traffic on semantics >:/
There are two other methods I should have included (which explains why I
haven't closed the resource in DOMDOC_AS_WRITEFILE mode.
I realise this question is borderline internals/general but no one in
the user community seems to be equipped to answer it for me. I
appologies if I have made an error in this judgement.
/**
* mass storage serialisation
*
* This function will dump the textual contents of the DOM document
(in it's
* current state to) file.
*
* @param uri path to destination file
* @return bool success or failure
*/
function Commit() {
if($this->mode == DOMDOC_AS_WRITEFILE) {
fwrite($this->fp,$this->xmlGetDoc());
flock($this->fp,LOCK_UN);
fclose($this->fp);
}
else {
return $this->objErr->throw(
"Commit() can only be used when domDoc is invoked in "
."DOMDOC_AS_WRITEFILE mode"
);
}
return true;
}
/**
* mass storage serialisation
*
* This function will dump the textual contents of the DOM document
(in it's
* current state to) a specified file.
*
* @param uri path to destination file
* @return bool success or failure
*/
function CommitToFile($uriDestination) {
$fp = fopen($uriDestination,"w+")
or $this->objErr->throw("CommitToFile: could not open
".$uriDestination." for writing");
flock($fp,LOCK_EX);
fwrite($fp,$this->xmlGetDoc())
or $this->objErr->throw("CommitToFile: could write to
".$uriDestination);
flock($fp,LOCK_UN);
fclose($fp);
return true;
}
Terrence,
Please keep this stuff off the internals@ list, and on the general list
where it belongs.
The internals list is for people developing the core of PHP using C, not
for people writing wrapper scripts in PHP.
You might have a better response on the various xml mailing lists.
--Wez.
There are two other methods I should have included (which explains why I
haven't closed the resource in DOMDOC_AS_WRITEFILE mode.I realise this question is borderline internals/general but no one in
the user community seems to be equipped to answer it for me. I
appologies if I have made an error in this judgement.