Attached is a patch for my initial cut for unicode and XML (made against
the /ext directory).
I started with XMLReader since it was the smallest.
The code can probably be optimized a bit, but I want to make sure this
is how it should be because the changes made here will be the changes
needed for the rest of the XML based extensions (simplexml, xsl,
xmlwriter, and xml to a point).
It includes the following:
Macros defined in php_libxml.h (names can be changed if anyone has a
problem with them).
ZVAL_XML_STRING(z, s, flags)
RETVAL_XML_STRING(s, flags)
These are used to take the UTF-8 output from libxml2 functions
and return correct string (UTF-16 when running unicode mode or UTF-8
when not)
XMLReader:
In order to maintain BC with PHP 5 it accepts unicode and binary
strings (UTF-8 as in PHP 5) as parameters. The paramters can be mixed
(some unicode and some binary so strings are properly converted to UTF-8
to work with libxml2).
In order to only require 1 hash table for properties, the
following is used in MINIT:
zend_u_hash_init(&xmlreader_prop_handlers, 0, NULL, NULL, 1,
(zend_bool)zend_ini_long("unicode.semantics",
sizeof("unicode.semantics"), 1));
Tests have been updated for unicode mode.
Let me know if anyone sees any problems with these changes.
Rob
Had some feedback about a problem with the attached file, so here's also
link to the diff.
http://www.ctindustries.net/patches/xmlunicode.diff.txt
Rob
Rob,
I have not tested the patch, but it looks good to me on cursory
overview. I assume it passes your tests?
The only comment I have is regarding the usage of 't' and 'T'
specifiers. Since you always have to pass binary UTF-8 strings to
libxml, we should always use 's' specifier and let PHP downconvert
Unicode strings based on the runtime encoding (which you set to UTF-8).
-Andrei
Attached is a patch for my initial cut for unicode and XML (made
against the /ext directory).
I started with XMLReader since it was the smallest.
The code can probably be optimized a bit, but I want to make sure this
is how it should be because the changes made here will be the changes
needed for the rest of the XML based extensions (simplexml, xsl,
xmlwriter, and xml to a point).It includes the following:
Macros defined in php_libxml.h (names can be changed if anyone has
a problem with them).
ZVAL_XML_STRING(z, s, flags)
RETVAL_XML_STRING(s, flags)
These are used to take the UTF-8 output from libxml2 functions
and return correct string (UTF-16 when running unicode mode or UTF-8
when not)XMLReader:
In order to maintain BC with PHP 5 it accepts unicode and binary
strings (UTF-8 as in PHP 5) as parameters. The paramters can be mixed
(some unicode and some binary so strings are properly converted to
UTF-8 to work with libxml2).In order to only require 1 hash table for properties, the
following is used in MINIT:
zend_u_hash_init(&xmlreader_prop_handlers, 0, NULL, NULL, 1,
(zend_bool)zend_ini_long("unicode.semantics",
sizeof("unicode.semantics"), 1));Tests have been updated for unicode mode.
Let me know if anyone sees any problems with these changes.
Andrei Zmievski wrote:
Rob,
I have not tested the patch, but it looks good to me on cursory
overview. I assume it passes your tests?
The only comment I have is regarding the usage of 't' and 'T'
specifiers. Since you always have to pass binary UTF-8 strings to
libxml, we should always use 's' specifier and let PHP downconvert
Unicode strings based on the runtime encoding (which you set to UTF-8).
Updated the code with your suggestion. I first attempted to eliminate
having to change converters when running with unicode off for all the
"t" parameters (save a few extra instructions there), but code is much
more manageable now than converting them manually.
Would like some feedback, though, on the changes made to xmlreader
before moving on to any of the other extensions (seeing the changes are
going to be pretty much the same).
Rob
Hey Rob,
Looks good. Have you tested the filesystem (filename) related functions
with non-ASCII filenames? Try making a file called "informaçon.xml" for
example, set unicode.filesystem_encoding=utf-8 (or whatever encoding
your filesystem uses) and see if you can read it.
-Andrei
Andrei Zmievski wrote:
Rob,
I have not tested the patch, but it looks good to me on cursory
overview. I assume it passes your tests?
The only comment I have is regarding the usage of 't' and 'T'
specifiers. Since you always have to pass binary UTF-8 strings to
libxml, we should always use 's' specifier and let PHP downconvert
Unicode strings based on the runtime encoding (which you set to
UTF-8).
Updated the code with your suggestion. I first attempted to eliminate
having to change converters when running with unicode off for all the
"t" parameters (save a few extra instructions there), but code is much
more manageable now than converting them manually.Would like some feedback, though, on the changes made to xmlreader
before moving on to any of the other extensions (seeing the changes
are going to be pretty much the same).Rob