unicode and xml extensions

19 years ago by Rob Richards — view source — reply

unread

Attached is a patch for my initial cut for unicode and XML (made against
the /ext directory).
I started with XMLReader since it was the smallest.
The code can probably be optimized a bit, but I want to make sure this
is how it should be because the changes made here will be the changes
needed for the rest of the XML based extensions (simplexml, xsl,
xmlwriter, and xml to a point).

It includes the following:
Macros defined in php_libxml.h (names can be changed if anyone has a
problem with them).
ZVAL_XML_STRING(z, s, flags)
RETVAL_XML_STRING(s, flags)
These are used to take the UTF-8 output from libxml2 functions
and return correct string (UTF-16 when running unicode mode or UTF-8
when not)

XMLReader:
   In order to maintain BC with PHP 5 it accepts unicode and binary

strings (UTF-8 as in PHP 5) as parameters. The paramters can be mixed
(some unicode and some binary so strings are properly converted to UTF-8
to work with libxml2).

   In order to only require 1 hash table for properties, the

following is used in MINIT:
zend_u_hash_init(&xmlreader_prop_handlers, 0, NULL, NULL, 1,
(zend_bool)zend_ini_long("unicode.semantics",
sizeof("unicode.semantics"), 1));

   Tests have been updated for unicode mode.

Let me know if anyone sees any problems with these changes.

Rob

19 years ago by Rob Richards — view source — reply

unread

Had some feedback about a problem with the attached file, so here's also
link to the diff.

http://www.ctindustries.net/patches/xmlunicode.diff.txt

Rob

19 years ago by Andrei Zmievski — view source — reply

unread

Rob,

I have not tested the patch, but it looks good to me on cursory
overview. I assume it passes your tests?
The only comment I have is regarding the usage of 't' and 'T'
specifiers. Since you always have to pass binary UTF-8 strings to
libxml, we should always use 's' specifier and let PHP downconvert
Unicode strings based on the runtime encoding (which you set to UTF-8).

-Andrei

Attached is a patch for my initial cut for unicode and XML (made
against the /ext directory).
I started with XMLReader since it was the smallest.
The code can probably be optimized a bit, but I want to make sure this
is how it should be because the changes made here will be the changes
needed for the rest of the XML based extensions (simplexml, xsl,
xmlwriter, and xml to a point).

It includes the following:
Macros defined in php_libxml.h (names can be changed if anyone has
a problem with them).
ZVAL_XML_STRING(z, s, flags)
RETVAL_XML_STRING(s, flags)
These are used to take the UTF-8 output from libxml2 functions
and return correct string (UTF-16 when running unicode mode or UTF-8
when not)

XMLReader:
In order to maintain BC with PHP 5 it accepts unicode and binary
strings (UTF-8 as in PHP 5) as parameters. The paramters can be mixed
(some unicode and some binary so strings are properly converted to
UTF-8 to work with libxml2).
  In order to only require 1 hash table for properties, the 
following is used in MINIT:
zend_u_hash_init(&xmlreader_prop_handlers, 0, NULL, NULL, 1,
(zend_bool)zend_ini_long("unicode.semantics",
sizeof("unicode.semantics"), 1));
  Tests have been updated for unicode mode.
Let me know if anyone sees any problems with these changes.

19 years ago by Rob Richards — view source — reply

unread

Andrei Zmievski wrote:

Rob,

I have not tested the patch, but it looks good to me on cursory
overview. I assume it passes your tests?
The only comment I have is regarding the usage of 't' and 'T'
specifiers. Since you always have to pass binary UTF-8 strings to
libxml, we should always use 's' specifier and let PHP downconvert
Unicode strings based on the runtime encoding (which you set to UTF-8).
Updated the code with your suggestion. I first attempted to eliminate
having to change converters when running with unicode off for all the
"t" parameters (save a few extra instructions there), but code is much
more manageable now than converting them manually.

Would like some feedback, though, on the changes made to xmlreader
before moving on to any of the other extensions (seeing the changes are
going to be pretty much the same).

Rob

19 years ago by Andrei Zmievski — view source — reply

unread

Hey Rob,

Looks good. Have you tested the filesystem (filename) related functions
with non-ASCII filenames? Try making a file called "informaçon.xml" for
example, set unicode.filesystem_encoding=utf-8 (or whatever encoding
your filesystem uses) and see if you can read it.

-Andrei

Andrei Zmievski wrote:

Rob,

I have not tested the patch, but it looks good to me on cursory
overview. I assume it passes your tests?
The only comment I have is regarding the usage of 't' and 'T'
specifiers. Since you always have to pass binary UTF-8 strings to
libxml, we should always use 's' specifier and let PHP downconvert
Unicode strings based on the runtime encoding (which you set to
UTF-8).
Updated the code with your suggestion. I first attempted to eliminate
having to change converters when running with unicode off for all the
"t" parameters (save a few extra instructions there), but code is much
more manageable now than converting them manually.

Would like some feedback, though, on the changes made to xmlreader
before moving on to any of the other extensions (seeing the changes
are going to be pretty much the same).

Rob

19 years ago by Andrei Zmievski — view source — reply

unread

Great! I'll put a slide about this into my talk for OSCON.

What're your plans for the rest of the XML extensions?

-Andrei

Andrei Zmievski wrote:

Hey Rob,

Looks good. Have you tested the filesystem (filename) related
functions with non-ASCII filenames? Try making a file called
"informaçon.xml" for example, set
unicode.filesystem_encoding=utf-8 (or whatever encoding your
filesystem uses) and see if you can read it.

Ok, used the php_stream_path_encode functionality I saw in
file_get_contents and am now able to read files with non-ASCII
chars while all tests still pass.
I believe all the conversions are now done for this extension. The
only functionality without any special handling occurs when passing
source XML as a string - XML() and setRelaxNGSchemaSource().

Rob

19 years ago by Rob Richards — view source — reply

unread

Almost done with DOM (3 more files to go), so hopefully by Monday. This
one will need a lot of testing though.

Rob

Andrei Zmievski wrote:

Great! I'll put a slide about this into my talk for OSCON.

What're your plans for the rest of the XML extensions?

-Andrei

19 years ago by Andrei Zmievski — view source — reply

unread

Awesome.

I am planning to add "s(encoding)" support to parameter parsing, by
the way, so getting strings in UTF-8 encoding will be a bit easier.
Would probably need to change the relevant portions of your commits.

-Andrei

Almost done with DOM (3 more files to go), so hopefully by Monday.
This one will need a lot of testing though.

Rob

Andrei Zmievski wrote:

Great! I'll put a slide about this into my talk for OSCON.

What're your plans for the rest of the XML extensions?

-Andrei

19 years ago by Rob Richards — view source — reply

unread

Andrei Zmievski wrote:

Awesome.

I am planning to add "s(encoding)" support to parameter parsing, by
the way, so getting strings in UTF-8 encoding will be a bit easier.
Would probably need to change the relevant portions of your commits.
Any idea when this should be ready, or should I just go ahead and commit
the DOM unicode changes now?

Rob

19 years ago by Andrei Zmievski — view source — reply

unread

I probably won't get to it this weekend. Might have it done during
OSCON next week, so it's up to you.

-Andrei

Andrei Zmievski wrote:

Awesome.

I am planning to add "s(encoding)" support to parameter parsing,
by the way, so getting strings in UTF-8 encoding will be a bit
easier. Would probably need to change the relevant portions of
your commits.
Any idea when this should be ready, or should I just go ahead and
commit the DOM unicode changes now?

Rob

19 years ago by Marcus Boerger — view source — reply

unread

Hello Andrei,

don't we have a char left for UTF-8 (maybe 8) as it would be a case that
we will have to use very often and checking for a string in braces will
take some time.

best regards
marcus

Friday, July 21, 2006, 9:39:32 PM, you wrote:

Awesome.

I am planning to add "s(encoding)" support to parameter parsing, by
the way, so getting strings in UTF-8 encoding will be a bit easier.
Would probably need to change the relevant portions of your commits.

-Andrei

Almost done with DOM (3 more files to go), so hopefully by Monday.
This one will need a lot of testing though.

Rob

Andrei Zmievski wrote:

Great! I'll put a slide about this into my talk for OSCON.

What're your plans for the rest of the XML extensions?

-Andrei

Best regards,
Marcus

19 years ago by Andrei Zmievski — view source — reply

unread

Maybe. An alternate way would be to add modifier to 's' that makes it
accept a converter to use for conversion.

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s>", &str,
&str_len, UG(utf8_conv)) == FAILURE) {
return;
}

This does mean that the caller will have to instantiate the converter
themselves, but might be more efficient if there are multiple 's'
parameters needing to use the same converter.

-Andrei

Hello Andrei,

don't we have a char left for UTF-8 (maybe 8) as it would be a case
that
we will have to use very often and checking for a string in braces
will
take some time.

best regards
marcus

19 years ago by Rob Richards — view source — reply

unread

imo, this would probably the easiest and best way to handle the conversions.

Rob

Andrei Zmievski wrote:

Maybe. An alternate way would be to add modifier to 's' that makes it
accept a converter to use for conversion.

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s>", &str,
&str_len, UG(utf8_conv)) == FAILURE) {
return;
}

This does mean that the caller will have to instantiate the converter
themselves, but might be more efficient if there are multiple 's'
parameters needing to use the same converter.

-Andrei