PHP Unicode extension in PHP6

18 years ago by Tomas Kuliavas — view source

unread

Hi,

Could you make unicode.semantics configurable at PHP_INI_ALL level? Or
maybe PHP6 has string functions that are not unicode aware?

--
Tomas

18 years ago by Antony Dovgal — view source

unread

Hi,

Could you make unicode.semantics configurable at PHP_INI_ALL level?

No.

Or maybe PHP6 has string functions that are not unicode aware?

All string functions are supposed to be able to work with both Unicode and binary strings.
Unicode is just an addition, it doesn't mean that binary strings are not supported anymore, even in Unicode mode.
But I don't really understand the reason for your question, care to provide more details?

--
Wbr,
Antony Dovgal

18 years ago by Tomas Kuliavas — view source

unread

Hi,

Could you make unicode.semantics configurable at PHP_INI_ALL level?

No.

Or maybe PHP6 has string functions that are not unicode aware?

All string functions are supposed to be able to work with both Unicode and
binary strings.
Unicode is just an addition, it doesn't mean that binary strings are not
supported anymore, even in Unicode mode.
But I don't really understand the reason for your question, care to
provide more details?

SquirrelMail scripts are designed to work with binary strings. They will
have to deal with emails written in many different character sets. In some
cases scripts must know string length in bytes and not in symbols. If PHP
starts converting email body or message parts, strings won't match
information stored in email headers.

If unicode.semantics are turned on, PHP6-dev breaks one time pad creation
and randomizing of mt_rand. crc32, base64_encode and fputs notices and
warnings. "function expects parameter 1 to be strictly a binary string,
Unicode string given". "x character unicode buffer downcoded for binary
stream runtime_encoding". I might provide sample code, if I find the way
to reduce existing code to something simple. Currently I am trying to
understand what exactly is broken in SquirrelMail functions.

I can fix these issues, but one day PHP might add similar checks to
str_replace(), array functions and pcre extension. Then it will break
character set conversion functions and any other code that operates with
8bit strings. Currently I am stuck on broken authentication and can't
check if other parts of interface are already broken.

I could not find the way to disable unicode.semantics in the script.
PHP_INI_PERDIR is not an option for scripts that are designed to be
portable. In some cases end user can't use .htaccess and can't control
php.ini or httpd.conf. mbstring function overloading effects can be
disabled. The way to turn off unicode.semantics is not documented. If
mbstring.func_overload is turned on, I can't trust string functions. Same
thing happens when unicode.semantics are turned on.

--
Tomas

18 years ago by Antony Dovgal — view source

unread

SquirrelMail scripts are designed to work with binary strings. They will
have to deal with emails written in many different character sets. In some
cases scripts must know string length in bytes and not in symbols. If PHP
starts converting email body or message parts, strings won't match
information stored in email headers.

Try this, you'll see it's really easy:
<?php
//in Unicode mode all strings created this way are Unicode strings
$s = "<any unicode string>";
var_dump(strlen(($s)));
var_dump(strlen((binary)$s));
?>

If unicode.semantics are turned on, PHP6-dev breaks one time pad creation
and randomizing of mt_rand. crc32, base64_encode and fputs notices and
warnings. "function expects parameter 1 to be strictly a binary string,
Unicode string given". "x character unicode buffer downcoded for binary
stream runtime_encoding". I might provide sample code, if I find the way
to reduce existing code to something simple. Currently I am trying to
understand what exactly is broken in SquirrelMail functions.

I don't think there is something broken.
You just pass unicode string to the functions which expected only binary ones.

I can fix these issues, but one day PHP might add similar checks to
str_replace(), array functions and pcre extension. Then it will break
character set conversion functions and any other code that operates with
8bit strings. Currently I am stuck on broken authentication and can't
check if other parts of interface are already broken.

I could not find the way to disable unicode.semantics in the script.

Sure, that's not possible.

PHP_INI_PERDIR is not an option for scripts that are designed to be
portable. In some cases end user can't use .htaccess and can't control
php.ini or httpd.conf. mbstring function overloading effects can be
disabled. The way to turn off unicode.semantics is not documented. If
mbstring.func_overload is turned on, I can't trust string functions. Same
thing happens when unicode.semantics are turned on.

--
Wbr,
Antony Dovgal

18 years ago by Tomas Kuliavas — view source

unread

SquirrelMail scripts are designed to work with binary strings. They will
have to deal with emails written in many different character sets. In
some
cases scripts must know string length in bytes and not in symbols. If
PHP
starts converting email body or message parts, strings won't match
information stored in email headers.

Try this, you'll see it's really easy:
<?php
//in Unicode mode all strings created this way are Unicode strings
$s = "<any unicode string>";
var_dump(strlen(($s)));
var_dump(strlen((binary)$s));
?>

http://www.php.net/language.types.type-juggling#language.types.typecasting

No (binary).
PHP 4.1.2 = parse error in test2.php on line 5.
PHP 5.2.0 = Parse error: syntax error, unexpected T_VARIABLE in test2.php
on line 5
It is E_PARSE error, so I can't apply fix after detecting PHP6.

Fix is not portable. We are talking about SquirrelMail code. Minimal PHP
4.1.x requirement.

Good fix is closer to
<?php
//in Unicode mode all strings created this way are Unicode strings
$s = "<any unicode string>";
var_dump(strlen($s));
// add php6 test here {
settype($s,'binary');
// }
var_dump(strlen($s));
?>

I will have to write wrappers for all affected functions. Not wise.

strlen("\xC4\x85") = 2. strlen((binary)"\xC4\x85") = 4. Not good. It is
one character in utf-8.

Nevermind. Will watch how PHP6-dev changes and wait for better documentation.

--
Tomas

18 years ago by Antony Dovgal — view source

unread

Try this, you'll see it's really easy:
<?php
//in Unicode mode all strings created this way are Unicode strings
$s = "<any unicode string>";
var_dump(strlen(($s)));
var_dump(strlen((binary)$s));
?>

http://www.php.net/language.types.type-juggling#language.types.typecasting

No (binary).
PHP 4.1.2 = parse error in test2.php on line 5.
PHP 5.2.0 = Parse error: syntax error, unexpected T_VARIABLE in test2.php
on line 5
It is E_PARSE error, so I can't apply fix after detecting PHP6.

Yes. We're talking of PHP6 here.

Fix is not portable. We are talking about SquirrelMail code. Minimal PHP
4.1.x requirement.

Good fix is closer to
<?php
//in Unicode mode all strings created this way are Unicode strings
$s = "<any unicode string>";
var_dump(strlen($s));
// add php6 test here {
settype($s,'binary');
// }
var_dump(strlen($s));
?>

In real life this hack would not be required, since you should/would be using streams returning binary data.
My example was just to show how to cast unicode to binary string.

I will have to write wrappers for all affected functions. Not wise.

strlen("\xC4\x85") = 2. strlen((binary)"\xC4\x85") = 4. Not good. It is
one character in utf-8.

I'm afraid I don't understand you again..

Nevermind. Will watch how PHP6-dev changes and wait for better documentation.

Watch? Just watching is pointless.
Contribute to the discussions, exchange ideas, help developers - that'll make your life easier eventually.

--
Wbr,
Antony Dovgal

18 years ago by Tomas Kuliavas — view source

unread

strlen("\xC4\x85") = 2. strlen((binary)"\xC4\x85") = 4. Not good. It is
one character in utf-8.

I'm afraid I don't understand you again..

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line. It
is bool(false), when unicode.semantics are turned on. Internal
SquirrelMail character set decoding functions write mapping tables in
hexadecimals or octals. In some cases they evaluate only byte value and
not whole symbol. Multibyte character set decoding can use recode, iconv
and mbstring, but most of single byte decoding is written in plain string
functions and stores hex to html mapping tables in associative arrays.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'ą',$s1);
?>

Expected result: ą
Got: ą

test setup (php6.0-200705190630) uses trimmed php.ini with only
unicode.semantics=on setting

unicode.fallback_encoding - no value
unicode.filesystem_encoding - no value
unicode.http_input_encoding - no value
unicode.output_encoding - no value
unicode.runtime_encoding - no value
unicode.script_encoding - no value
unicode.semantics - On
unicode.stream_encoding - UTF-8

18 years ago by Stefan Walk — view source

unread

Disclaimer: I don't know much about the way unicode is implemented in
php, i have only used it a bit, but i believe i can clear some things
up here.

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line.

You expect wrong things. "\xC4\x85" is a unicode string containing two
codepoints, those at 0xC4 and 0x85 (LATIN CAPITAL LETTER A WITH
DIAERESIS and NEXT LINE (NEL)), while "ą" is a unicode string
containing one code point (0x0105, LATIN SMALL LETTER A WITH OGONEK)
(see
http://www.unicode.org/charts/PDF/U0080.pdf and
http://www.unicode.org/charts/PDF/U0100.pdf). Different strings, so
comparision should return false. If you want to type bytes, use the
"b" prefix: b"\xC4\x85", and compare that with the binary version of
your string literal. var_dump(b"ą" == b"\xC4\x85"); should give you
bool(true) if your encoding is utf-8.

It
is bool(false), when unicode.semantics are turned on. Internal
SquirrelMail character set decoding functions write mapping tables in
hexadecimals or octals. In some cases they evaluate only byte value and
not whole symbol. Multibyte character set decoding can use recode, iconv
and mbstring, but most of single byte decoding is written in plain string
functions and stores hex to html mapping tables in associative arrays.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'ą',$s1);
?>

Expected result: ą
Got: ą

Same thing. If you want binary replacements, use binary strings, not
unicode strings.

Regards,
Stefan

18 years ago by Tomas Kuliavas — view source

unread

Disclaimer: I don't know much about the way unicode is implemented in
php, i have only used it a bit, but i believe i can clear some things
up here.

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in
utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line.

You expect wrong things. "\xC4\x85" is a unicode string containing two
codepoints, those at 0xC4 and 0x85 (LATIN CAPITAL LETTER A WITH
DIAERESIS and NEXT LINE (NEL)), while "ą" is a unicode string
containing one code point (0x0105, LATIN SMALL LETTER A WITH OGONEK)
(see
http://www.unicode.org/charts/PDF/U0080.pdf and
http://www.unicode.org/charts/PDF/U0100.pdf). Different strings, so
comparision should return false. If you want to type bytes, use the
"b" prefix: b"\xC4\x85", and compare that with the binary version of
your string literal. var_dump(b"ą" == b"\xC4\x85"); should give you
bool(true) if your encoding is utf-8.

Latin capital letter A with diaeresis is 00C4. Not C4.

I wrote two 8bit values. Not two 16bit ones. Interpreter tries to outsmart
me and thinks that I want 00C4, when I write C4.

http://www.php.net/language.types.string

\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular
expression is a character in hexadecimal notation

One or two alphanumerics after x. This escape is used to write 8bit
values. You can't write 16 bit Unicode characters with one escape.

And again you are suggesting me unportable solution.
Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in
test5.php on line 2

I don't want to maintain different script version for PHP6
unicode.semantics=on.

It
is bool(false), when unicode.semantics are turned on. Internal
SquirrelMail character set decoding functions write mapping tables in
hexadecimals or octals. In some cases they evaluate only byte value and
not whole symbol. Multibyte character set decoding can use recode, iconv
and mbstring, but most of single byte decoding is written in plain
string
functions and stores hex to html mapping tables in associative arrays.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'ą',$s1);
?>

Expected result: ą
Got: ą

Same thing. If you want binary replacements, use binary strings, not
unicode strings.

mbstring.func_overload and unicode.semantics decisions must be made by
script writers and not by end users. That's why I asked for PHP_INI_ALL
level controls.

I'll wait for better documentation on unicode.*_encoding options and will
see what I can do with them.

--
Tomas

18 years ago by Stefan Walk — view source

unread

Latin capital letter A with diaeresis is 00C4. Not C4.

Pay attention in maths, leading zeroes don't change a number.

I wrote two 8bit values. Not two 16bit ones. Interpreter tries to outsmart
me and thinks that I want 00C4, when I write C4.

No, you didn't do anything with bits. "" is a unicode string, in
unicode strings you are handling codeunits, not bytes. And codeunit
0xC4 is the same as the codeunit 0x00C4 because it's the same number,
and it's the codeunit pointing to a capital A with diaresis.

http://www.php.net/language.types.string

\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular
expression is a character in hexadecimal notation

One or two alphanumerics after x. This escape is used to write 8bit
values. You can't write 16 bit Unicode characters with one escape.

You are quoting php5 documentation, you can't expect the documentation
to reflect code that isn't even alpha. What you quote is true in php6
for binary strings (b prefix) when you read "character" in the C
sense. (When you read "character" as "codeunit" it's true for php6 too

but you shouldn't use the word "character" that much, as a
"character" is a pretty misleading concept - do you mean a codeunit, a
codepoint, a grapheme, a glyph?) And you CAN write a codeunit in one
escape, like "\u0105". Note that codeunit != codepoint,
var_dump(strlen("\uD801\uDC00")); gives int(1) because there are
surrogates involved.

And again you are suggesting me unportable solution.
Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in
test5.php on line 2

Tough luck. Unicode is a major change, no major change without
breakage. It could be made more compatible by using 'u' for marking
unicode strings and no prefix for binary strings, but most of the time
you want to handle text, not binary data, so that would be an
additional burden for the developer. If you definitely want to keep
supporting old versions, i'd suggest you use different files for
different versions and conditionally include them. Nightmare to
maintain, but that's another thing...

I don't want to maintain different script version for PHP6
unicode.semantics=on.

Well, /I/ don't want to see progress hindered by backwards compatibility.

I'll wait for better documentation on unicode.*_encoding options and will
see what I can do with them.

Well, no encoding option will make "ą" == "\xC4\x85"...
To see how unicode string handling works, you can have a look at
python. It's pretty similar...

Regards,
Stefan

18 years ago by Tomas Kuliavas — view source

unread

Latin capital letter A with diaeresis is 00C4. Not C4.

Pay attention in maths, leading zeroes don't change a number.

they do, if it is not a number.

'00C4' + '0085' = '00C40085'

'C4' + '85' = 'C485'

'00C40085' != 'C485'

--
Tomas

18 years ago by Andrei Zmievski — view source

unread

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in
utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line.

var_dump("ą" == b"\xC4\x85");

This will give you what you want, if the script is written in UTF-8
and your runtime encoding is set to UTF-8.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'ą',$s1);
?>

Expected result: ą
Got: ą

test setup (php6.0-200705190630) uses trimmed php.ini with only
unicode.semantics=on setting

unicode.fallback_encoding - no value
unicode.filesystem_encoding - no value
unicode.http_input_encoding - no value
unicode.output_encoding - no value
unicode.runtime_encoding - no value
unicode.script_encoding - no value
unicode.semantics - On
unicode.stream_encoding - UTF-8

Why didn't you set any encoding settings?

-Andrei

18 years ago by Tomas Kuliavas — view source

unread

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in
utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line.

var_dump("ą" == b"\xC4\x85");

This will give you what you want, if the script is written in UTF-8
and your runtime encoding is set to UTF-8.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'ą',$s1);
?>

Expected result: ą
Got: ą

test setup (php6.0-200705190630) uses trimmed php.ini with only
unicode.semantics=on setting

unicode.fallback_encoding - no value
unicode.filesystem_encoding - no value
unicode.http_input_encoding - no value
unicode.output_encoding - no value
unicode.runtime_encoding - no value
unicode.script_encoding - no value
unicode.semantics - On
unicode.stream_encoding - UTF-8

Why didn't you set any encoding settings?

They are not documented and I am testing configurations that might break
scripts. If I test things and want to make code portable, configuration is
not supposed to be rational. I can set option with ini_set(), if I
understand what option does and it fixes the issue.

http://www.php.net/unicode

Do you have updated documentation version which explains encoding settings
and lists available configuration values? Or am I testing PHP6 too early
and you are still months or years away from 6.0.0 betas and rcs? Could you
implement pseudo encoding similar to 'pass' encoding used in mbstring?
Current implementation does not give controls needed by script writers.

SquirrelMail scripts are not written in unicode. They are in ascii. If
some 8bit value is used, it is always written in octal or hex notation.
These hex values are not written in one character set. In some cases
scripts use byte values. For example, locating first utf-8 byte or looking
for 0x80-0xFF bytes in string. In other cases they are written in source
or target character set. For example, iso-8859-2 decoding function
contains array with iso-8859-2 hex values mapped to html codes. Code can't
use raw 8bit strings, because they might be corrupted in misconfigured
editor used by developer and it is very hard to track such corruption.
8bit data can come only from user input (composed emails and preferences,
html forms, one common charset) and imap server (received emails, lots of
different charsets and encodings).

--
Tomas

18 years ago by Andrei Zmievski — view source

unread

They are not documented and I am testing configurations that might
break
scripts. If I test things and want to make code portable,
configuration is
not supposed to be rational. I can set option with ini_set(), if I
understand what option does and it fixes the issue.

http://www.php.net/unicode

Do you have updated documentation version which explains encoding
settings
and lists available configuration values? Or am I testing PHP6 too
early
and you are still months or years away from 6.0.0 betas and rcs?
Could you
implement pseudo encoding similar to 'pass' encoding used in mbstring?
Current implementation does not give controls needed by script
writers.

Have you looked at any of the talks I've given on this topic?

http://www.gravitonic.com/talks

That's the closest thing to documentation you'll find right now.
Unfortunately, documentation always lags behind the actual development.

SquirrelMail scripts are not written in unicode. They are in ascii. If
some 8bit value is used, it is always written in octal or hex
notation.
These hex values are not written in one character set. In some cases
scripts use byte values. For example, locating first utf-8 byte or
looking
for 0x80-0xFF bytes in string. In other cases they are written in
source
or target character set. For example, iso-8859-2 decoding function
contains array with iso-8859-2 hex values mapped to html codes.
Code can't
use raw 8bit strings, because they might be corrupted in misconfigured
editor used by developer and it is very hard to track such corruption.
8bit data can come only from user input (composed emails and
preferences,
html forms, one common charset) and imap server (received emails,
lots of
different charsets and encodings).

Maybe you don't need to turn unicode.semantics=on, if you are working
only with 8-bit strings.

-Andrei

18 years ago by Andrei Zmievski — view source

unread

This is by design. If you prefer to work with actual bytes, use
binary strings or literals. In unicode strings \xC4 is actually a
codepoint (UTF-16 codepoint) specifying character U+00C4.

-Andrei

strlen("\xC4\x85") = 2. strlen((binary)"\xC4\x85") = 4. Not good.
It is
one character in utf-8.

PHP Unicode extension in PHP6

http://www.php.net/language.types.string

\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular expression is a character in hexadecimal notation

http://www.php.net/language.types.string

\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular expression is a character in hexadecimal notation

\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular
expression is a character in hexadecimal notation

\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular
expression is a character in hexadecimal notation