Done. See latest commit: switch is removed and mode is defaulted to On.
-Andrei
Hi all,
Is there any chance of getting the unicode.semantics issue sorted soon?
Andrei Zmievski escribió:
Done. See latest commit: switch is removed and mode is defaulted to On.
Thank you , VERY MUCH Andrei, now we can really move forward..
--
"Progress is possible only if we train ourselves to think about programs
without thinking of them as pieces of executable code.” - Edsger W.
Dijkstra
Cristian Rodríguez R.
Platform/OpenSUSE - Core Services
SUSE LINUX Products GmbH
Research & Development
http://www.opensuse.org/
Andrei Zmievski escribió:
Done. See latest commit: switch is removed and mode is defaulted to
On.Thank you , VERY MUCH Andrei, now we can really move forward..
well the next question is binary or unicode as the default for
strings ..
regards,
Lukas
Yes, but we should run some sort of test suite and see what exactly breaks
with unicode strings as the default ones.
On Mon, May 19, 2008 at 2:42 PM, Lukas Kahwe Smith mls@pooteeweet.org
wrote:
Andrei Zmievski escribió:
Done. See latest commit: switch is removed and mode is defaulted to On.
Thank you , VERY MUCH Andrei, now we can really move forward..
well the next question is binary or unicode as the default for strings ..
regards,
Lukas
The idea here was to find volunteers who'd try and run their apps and some standard apps on this so we can figure out what the migration path looks like. That'd also give us the necessary data point to make a decision on default for strings. I think none of us right now truly knows what to expect.
Andi
-----Original Message-----
From: Andrei Zmievski [mailto:andrei@gravitonic.com]
Sent: Monday, May 19, 2008 2:54 PM
To: Lukas Kahwe Smith
Cc: Cristian Rodríguez; Steph Fox; internals
Subject: Re: [PHP-DEV] Unicode progress [Was: unicode.semantics ad infinitum]
Yes, but we should run some sort of test suite and see what exactly breaks
with unicode strings as the default ones.
On Mon, May 19, 2008 at 2:42 PM, Lukas Kahwe Smith mls@pooteeweet.org
wrote:
Andrei Zmievski escribió:
Done. See latest commit: switch is removed and mode is defaulted to On.
Thank you , VERY MUCH Andrei, now we can really move forward..
well the next question is binary or unicode as the default for strings ..
regards,
Lukas
Hi Andi, et. al,
Am Montag, den 19.05.2008, 23:03 -0700 schrieb Andi Gutmans:
The idea here was to find volunteers who'd try and run their apps and
some standard apps on this so we can figure out what the migration
path looks like. That'd also give us the necessary data point to make
a decision on default for strings. I think none of us right now truly
knows what to expect.
This might be the time, where we need to think about are more
substantial solution for this problem. What I have in mind is to provide
a system, with the current development versions of PHP (today is would
be 5_2, 5_3 and HEAD) installed. We would liberally grant Shell-Accounts
and DBs to those who are willing to test their web applications there.
Say the S9Y-developers get an account to test their blog system, they
get three Vhosts: s9y.php6.func.php.net, s9y.php53.func.php.net,
s9y.php52.func.php.net. We would continously build CVS snapshots on this
systems to make sure they test against the latest version.
Projects with more resources could run a selenium grid against the
server for their functional tests, the rest would simply test the
applications by hand. We could even install a cruisecontrol (resp.
phpUnderControl) instance to run projects unit tests if the have some.
cu, Lars
Andi Gutmans wrote:
The idea here was to find volunteers who'd try and run their apps and
some standard apps on this so we can figure out what the migration
path looks like. That'd also give us the necessary data point to make
a decision on default for strings. I think none of us right now truly
knows what to expect.
As I a PHP developer on Windows, I pay very close attention to
development work on PHP and regularly test my code against PHP 5.3,
hand-built from the CVS. Nearly all of the PECL extensions are broken on
Windows for PHP 6. The http extension has a very nasty problem: its
fatal error appears to loop nearly ad infinitum. This makes compiling
snaps locally very difficult--I've given up. Why is the log file 4.4 MB
long?
Also, as stands it appears most non-trivial applications are fairly
broken. Those issues need to be resolved first, before we toss Unicode
onto the pile.
--
Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
Edward Z. Yang wrote:
Also, as stands it appears most non-trivial applications are fairly
broken. Those issues need to be resolved first, before we toss Unicode
onto the pile.
I retract this statement; I didn't realize we had already turned on
Unicode semantics.
--
Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
Hi Edward,
As I a PHP developer on Windows, I pay very close attention to
development work on PHP and regularly test my code against PHP 5.3,
hand-built from the CVS. Nearly all of the PECL extensions are broken on
Windows for PHP 6. The http extension has a very nasty problem: its
fatal error appears to loop nearly ad infinitum. This makes compiling
snaps locally very difficult--I've given up. Why is the log file 4.4 MB
long?
Because nobody's been trying to compile snaps against PHP 6? Basically most
extension developers decided to stay with 5_3 for now and haven't migrated
their code to work with Unicode on. I would hope, though, that if PHP 6 is
Unicode-only there might be more enthusiasm for adapting code to work with
it.
Also, as stands it appears most non-trivial applications are fairly
broken. Those issues need to be resolved first, before we toss Unicode
onto the pile.
I think the test script and ZE problems should be resolved first... plus I
just discovered that my php.ini isn't being found when I run commandline
tests (e.g. php -r "var_dump(strcasecmp());" doesn't throw an error unless I
explicitly set error_reporting from the commandline, despite having
E_ALL|E_STRICT in the INI). These are basic things, nobody can really go
much further until they're fixed.
So I agree with you - it's far too soon for apps testing at this stage.
- Steph
I think the test script and ZE problems should be resolved first...
And to do that we need to drop the Unicode Switch =)
plus I just discovered that my php.ini isn't being found when I run commandline
tests (e.g. php -r "var_dump(strcasecmp());" doesn't throw an error unless I
explicitly set error_reporting from the commandline, despite having
E_ALL|E_STRICT in the INI). These are basic things, nobody can really go
much further until they're fixed.
Your php.ini is being found & read allright, the problem is that the INI scanner looks
for string constants, while all constants are registered as unicode constants in Unicode mode.
So it fails to find E_ALL
constant and ends up with atoi("E_ALL"), which is 0.
I committed a fix for this one a minute ago.
The broken function names in error messages are easy to fix too, as soon as we
drop the Switch - functions in Unicode mode are registered using their Unicode names,
but then a weird thing happens:
zend_register_standard_ini_entries() makes Unicode copies of function/class/constant tables
(ignoring the fact that they are already Unicode ones), this is why you see only the first
character in function names.
The patch is as easy as this: http://dev.daylessday.org/diff/fix_unicode_function_names.diff
I didn't commit this one yet as I didn't investigate how it would affect non-Unicode mode
(and whether I should worry about non-Unicode mode at all).
--
Wbr,
Antony Dovgal
Hey Tony,
First off, thanks for caring :)
And to do that we need to drop the Unicode Switch =)
Not fix the streams issue?
I committed a fix for this one a minute ago.
Thanks!
The patch is as easy as this:
http://dev.daylessday.org/diff/fix_unicode_function_names.diff
This is the biggest problem, isn't it - that things like this obviously
hadn't been tested until now. I fixed the same issue in Phar last week - it
probably will need un-fixing there once your patch goes in :\
I didn't commit this one yet as I didn't investigate how it would affect
non-Unicode mode (and whether I should worry about non-Unicode mode at
all).
I would've hoped that PHP 6 would be Unicode throughout. Maybe jumping
around when Andrei removed the physical switch was premature... obviously
the UG(unicode) checks are still in place, whereas their binary alternatives
shouldn't still exist. Then again some things are unclear. What should
happen with pack()
, for example?
- Steph
Hey Tony,
First off, thanks for caring :)
And to do that we need to drop the Unicode Switch =)
Not fix the streams issue?
That, too =)
--
Wbr,
Antony Dovgal
Hey Tony,
First off, thanks for caring :)
And to do that we need to drop the Unicode Switch =)
Not fix the streams issue?
Felipe has taken care of that issue, so now we have only one uber-major issue left.
--
Wbr,
Antony Dovgal
Not fix the streams issue?
Felipe has taken care of that issue, so now we have only one uber-major
issue left.
Your patch? Or the changed error message?
FWIW the --EXPECT-- sections on those tests are different in 5_3 and would
pass, so maybe they should just match those?
- Steph
Not fix the streams issue?
Felipe has taken care of that issue, so now we have only one uber-major
issue left.Your patch? Or the changed error message?
Yes, I mean that problem with storing string/Unicode function names.
That patch seems to be no good, run-tests.php doesn't work after I apply it
(some functions are not found).
--
Wbr,
Antony Dovgal
Yes, I mean that problem with storing string/Unicode function names.
That patch seems to be no good, run-tests.php doesn't work after I apply
it (some functions are not found).
<shrug /> it'll be a lot easier to see what's going on if/when the
UG(unicode) if'n'butting goes. There's obviously a conversion too many
somewhere down the line @ present.
- Steph
Yes, I mean that problem with storing string/Unicode function names.
That patch seems to be no good, run-tests.php doesn't work after I apply
it (some functions are not found).<shrug /> it'll be a lot easier to see what's going on if/when the
UG(unicode) if'n'butting goes. There's obviously a conversion too many
somewhere down the line @ present.
Right.
That's exactly what I meant.
--
Wbr,
Antony Dovgal
Not fix the streams issue?
Felipe has taken care of that issue, so now we have only one uber-major
issue left.Your patch? Or the changed error message?
Yes, I mean that problem with storing string/Unicode function names.
That patch seems to be no good, run-tests.php doesn't work after I apply it
(some functions are not found).
This patch fixes it:
http://dev.daylessday.org/diff/shell_exec_unicode.diff
Still, I see no point to fix something until we remove the switch.
--
Wbr,
Antony Dovgal
This patch fixes it:
http://dev.daylessday.org/diff/shell_exec_unicode.diff
Coolth :)
Still, I see no point to fix something until we remove the switch.
Let's give Scott a chance to get home from php|tek and present his patch to
do just that.
- Steph (loin-girder extraordinaire)
Nothing premature about it. It was time.
UG(unicode) checks are still secondary I think - they don't prevent us
from doing tests and moving forward, although cleaning them up would be
nice.
pack()
should take binary strings only, methinks.
-Andrei
Steph Fox wrote:
I would've hoped that PHP 6 would be Unicode throughout. Maybe jumping
around when Andrei removed the physical switch was premature...
obviously the UG(unicode) checks are still in place, whereas their
binary alternatives shouldn't still exist. Then again some things are
unclear. What should happen withpack()
, for example?
- Steph
Hey Andrei,
UG(unicode) checks are still secondary I think - they don't prevent us
from doing tests and moving forward, although cleaning them up would be
nice.
Cleaning them up would make it possible to find and fix the bugs we already
know are there ;) There don't seem to be too many, but when it comes to
tracking down conversions... bleh...
pack()
should take binary strings only, methinks.
It does - 'UG(ascii_conv)' - but sometimes there's a unicode-to-binary
conversion warning thrown by that, and then you have to cast the variable to
a binary empty string to suppress the warning before you assign the pack()
result to it. (See ext/phar/tests/phar_test.inc.) There are similar issues
with unpack()
, and with fwrite()
/fread() and friends when dealing with a
purely binary string.
What concerns me at this point is that we could and should be telling PHP
users how to future-proof their code during the move from PHP 4 to PHP 5. If
we can get a fix on that now and get word out, we'll have made their future
migration path much smoother.
- Steph
Steph Fox wrote:
What concerns me at this point is that we could and should be telling
PHP users how to future-proof their code during the move from PHP 4 to
PHP 5. If we can get a fix on that now and get word out, we'll have made
their future migration path much smoother.
As far as I can tell, it's effectively impossible for anyone who wants
to support "low" PHP 5 versions. That's because the b"" prefix wasn't
introduced until PHP 5.2.1, and that prefix is crucial to
future-proofing any code that deals with binary data.
--
Edward Z. Yang GnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
UG(ascii_conv) is for the format string, not the arguments. The warning
is thrown by this code:
convert_to_string_ex(argv[currentarg]);
It should check the arg type and return with a warning if it's Unicode.
-Andrei
Steph Fox wrote:
It does - 'UG(ascii_conv)' - but sometimes there's a unicode-to-binary
conversion warning thrown by that, and then you have to cast the
variable to a binary empty string to suppress the warning before you
assign thepack()
result to it. (See ext/phar/tests/phar_test.inc.)
There are similar issues withunpack()
, and withfwrite()
/fread() and
friends when dealing with a purely binary string.What concerns me at this point is that we could and should be telling
PHP users how to future-proof their code during the move from PHP 4 to
PHP 5. If we can get a fix on that now and get word out, we'll have made
their future migration path much smoother.
- Steph
Hi Andrei,
I see where you're coming from over the (binary)|(string) issue, just trying
to find some way that would preserve a little more back compatibility.
UG(ascii_conv) is for the format string, not the arguments. The warning is
thrown by this code:convert_to_string_ex(argv[currentarg]);
It should check the arg type and return with a warning if it's Unicode.
OK, so there are some things that should maybe 'just happen' here, and also
when 'S' is used:
PHP_NAMED_FUNCTION(php_if_crc32)
{
char *p;
int len, nr;
php_uint32 crcinit = 0;
register php_uint32 crc;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "S", &p, &nr) ==
FAILURE) {
return;
}
...
Warning: crc32()
expects parameter 1 to be strictly a binary string, Unicode
string given in ...
Surely if a function's expecting a binary string it should do a silent
conversion, and only throw a warning if the conversion fails? I don't see
why the onus should always be on the user to adapt to this.
- Steph
Hi,
Warning:
crc32()
expects parameter 1 to be strictly a binary string, Unicode
string given in ...Surely if a function's expecting a binary string it should do a silent
conversion, and only throw a warning if the conversion fails? I don't see
why the onus should always be on the user to adapt to this.
For some functions taking binary strings is critical for working nicely
with an automatic conversion in this case
crc32(u"äöü")
and
crc32(b"äöü")
would give completely different results depending on the runtime
encoding, relying on a implicit conversion there is most likely a bug
(at least for apps written with PHP 6 in mind).
Oh and I might probably also argue that
crc32(u"äöü")
should give the crc32 of the internal representation (utf-16...) of the
string, which is a total wtf for the user then.
The correct solution is to make safe use of the "S" modifier and not
using it too much.
As binary casts are allowed in modern PHP versions I don't see this as
an issue, while such a cast isn't absolutely the best thing to do: I'd
go with unicode_encode() to be sure about the encoding being used,
everything else is prone to fail some time (some code changing
unicode.runtime_encoding for some random reason...)
johannes
Heya Johannes,
For some functions taking binary strings is critical for working nicely
with an automatic conversion in this case
crc32(u"äöü")
and
crc32(b"äöü")
would give completely different results depending on the runtime
encoding,
Yes - but why should the user have to do the casting? Why can't the function
itself cast to binary when it has an 'S' modifier? Like, during
zend_parse_parameters() for example? Whatever happened to keeping PHP
simple?
relying on a implicit conversion there is most likely a bug
(at least for apps written with PHP 6 in mind).
Oh and I might probably also argue that
crc32(u"äöü")
should give the crc32 of the internal representation (utf-16...) of the
string, which is a total wtf for the user then.
Nobody's asking to be able to cast it to unicode. I'm asking whether it's
entirely necessary to force users to cast to binary all over the place, and
a strict binary parameter spec looks like being one place where the cast
could be done internally.
The correct solution is to make safe use of the "S" modifier and not
using it too much.As binary casts are allowed in modern PHP versions I don't see this as
an issue, while such a cast isn't absolutely the best thing to do: I'd
go with unicode_encode() to be sure about the encoding being used,
everything else is prone to fail some time (some code changing
unicode.runtime_encoding for some random reason...)
You're telling me an explicit cast to binary could fail internally but not
externally? That doesn't make a lot of sense somehow.
- Steph
Steph,
Heya Johannes,
For some functions taking binary strings is critical for working nicely
with an automatic conversion in this case
crc32(u"äöü")
and
crc32(b"äöü")
would give completely different results depending on the runtime
encoding,Yes - but why should the user have to do the casting? Why can't the function
itself cast to binary when it has an 'S' modifier? Like, during
zend_parse_parameters() for example? Whatever happened to keeping PHP
simple?
Since only the users knows the correct encoding. The function might
fallback to unicode.runtime_encoding which can be wrong. And it's hard
to track the reason.
relying on a implicit conversion there is most likely a bug
(at least for apps written with PHP 6 in mind).
Oh and I might probably also argue that
crc32(u"äöü")
should give the crc32 of the internal representation (utf-16...) of the
string, which is a total wtf for the user then.Nobody's asking to be able to cast it to unicode. I'm asking whether it's
entirely necessary to force users to cast to binary all over the place, and
a strict binary parameter spec looks like being one place where the cast
could be done internally.
In this case there's no cast but the most simple implementation of
crc32()
on a unicode string.
The correct solution is to make safe use of the "S" modifier and not
using it too much.As binary casts are allowed in modern PHP versions I don't see this as
an issue, while such a cast isn't absolutely the best thing to do: I'd
go with unicode_encode() to be sure about the encoding being used,
everything else is prone to fail some time (some code changing
unicode.runtime_encoding for some random reason...)You're telling me an explicit cast to binary could fail internally but not
externally? That doesn't make a lot of sense somehow.
Externally the user is responsible to select the proper encoding
internally PHP has to guess.
johannes
Johannes,
You're telling me an explicit cast to binary could fail internally but
not
externally? That doesn't make a lot of sense somehow.Externally the user is responsible to select the proper encoding
internally PHP has to guess.
case 's':
case 'S':
{
char **p = va_arg(*va, char **);
int *pl = va_arg(*va, int *);
UConverter conv = NULL;
<snip />
switch (Z_TYPE_PP(arg)) {
<snip />
case IS_UNICODE:
/ handle conversion of Unicode to binary with a specific converter /
if (conv != NULL) { / this is an 's' specifier */
SEPARATE_ZVAL_IF_NOT_REF(arg);
if (convert_to_string_with_converter(*arg, conv) == FAILURE) {
return "";
}
*p = Z_STRVAL_PP(arg);
pl = Z_STRLEN_PP(arg);
break;
} else if (c == 'S' && Z_TYPE_PP(arg) != IS_NULL / NULL
is ok /) {
return "strictly a binary string";
}
/ fall through */
I'll try to explain why this isn't useful. First off, you get anomalies like
this:
C:\sandbox\php6\Debug_TS>php -r "echo crc32('');"
Warning: crc32()
expects parameter 1 to be strictly a binary string, Unicode
string given in Command line code on line 1
C:\sandbox\php6\Debug_TS>php -r "echo crc32(null);"
0
Second, you don't always get the same value anyway if the encoding changes.
Test script:
echo crc32((binary)'שלום')."\n";
echo crc32((binary)'AKUO');
with the script saved in UTF-8 and
unicode.fallback_encoding=UTF-8
unicode.runtime_encoding=UTF-8
unicode.stream_encoding=UTF-8
output is:
-1600612531
1603041141
with the same script saved in ISO-8859-8 and
unicode.fallback_encoding=ISO-8859-8
unicode.runtime_encoding=ISO-8859-8
unicode.stream_encoding=ISO-8859-8
output is:
-2023737703
1603041141
These are exactly the same results I see under PHP 5, depending whether the
script is saved in ISO-8859-8 or UTF-8.
Now if I remove the (binary) cast and alter the relevant section of
zend_parse_arg_impl():
} else if (c == 'S' && Z_TYPE_PP(arg) != IS_NULL /* `NULL` is ok */) {
if (zval_unicode_to_string(*(arg) TSRMLS_CC) == FAILURE) {
return "strictly a binary string";
}
}
/* fall through */
I get exactly the same results again, with or without the binary cast. All
that changes is I don't get an error when I skip the casting.
If the script encoding doesn't match up with whatever's set in INI, I don't
get as far as that stuff anyway:
Warning: Illegal or truncated character in input: offset 0, state=0 in
C:\sandbox\php-src\Debug_TS\help.php on line 5
Parse error: parse error, expecting `')'' in
C:\sandbox\php-src\Debug_TS\help.php on line 5
- regardless of whether I cast to binary or not, and regardless of whether
I've messed with the src.
- Steph
johannes
Steph Fox wrote:
OK, so there are some things that should maybe 'just happen' here, and
also when 'S' is used:PHP_NAMED_FUNCTION(php_if_crc32)
{
char *p;
int len, nr;
php_uint32 crcinit = 0;
register php_uint32 crc;if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "S", &p, &nr) ==
FAILURE) {
return;
}
...Warning:
crc32()
expects parameter 1 to be strictly a binary string,
Unicode string given in ...Surely if a function's expecting a binary string it should do a silent
conversion, and only throw a warning if the conversion fails? I don't
see why the onus should always be on the user to adapt to this.
No, it shouldn't. Because crc32()
and other HMACs depend on the actual
bytes, there shouldn't be any automatic conversion. The user needs to
pass in the binary string (bytes).
-Andrei
Yes, but we should run some sort of test suite and see what exactly breaks
with unicode strings as the default ones.
Right, but in order to do this we need to add filtered streams support to stream_select()
.
At the moment all tests in HEAD fail because of this:
Warning: stream_select()
: cannot cast a filtered stream on this system in /local/qa/head/run-tests.php on line 938
Warning: stream_select()
: No stream arrays were passed in /local/qa/head/run-tests.php on line 938
--
Wbr,
Antony Dovgal
At the moment all tests in HEAD fail because of this:
Warning:stream_select()
: cannot cast a filtered stream on this system in
/local/qa/head/run-tests.php on line 938
Warning:stream_select()
: No stream arrays were passed in
/local/qa/head/run-tests.php on line 938
Why not just use the 5_3 run-tests.php, which works?
- Steph
At the moment all tests in HEAD fail because of this:
Warning:stream_select()
: cannot cast a filtered stream on this system in
/local/qa/head/run-tests.php on line 938
Warning:stream_select()
: No stream arrays were passed in
/local/qa/head/run-tests.php on line 938Why not just use the 5_3 run-tests.php, which works?
run-tests.php in 5_3 uses stream_select()
, too.
And stream_select()
in HEAD doesn't work because Unicode is now enabled by default.
--
Wbr,
Antony Dovgal
run-tests.php in 5_3 uses
stream_select()
, too.
Andstream_select()
in HEAD doesn't work because Unicode is now enabled by
default.
Bizarre. I ran 400-odd tests with run-tests from 5_3 and never saw that
error...
- Steph
--
Wbr, Antony Dovgal
run-tests.php in 5_3 uses
stream_select()
, too.
Andstream_select()
in HEAD doesn't work because Unicode is now enabled
by default.Bizarre. I ran 400-odd tests with run-tests from 5_3 and never saw that
error...
Ah, I have it. My TEST_PHP_EXECUTABLE isn't the same as the one running the
script - it's 5.3.0-dev. So make test will fail, but my manual tests don't.
I'm seeing 85% pass rate in Zend/tests, investigating.
- Steph