Before we go breaking things, please read this document that
describes how PHP will support the Unicode standard natively.
Hopefully the attachment will work.
Thanks,
-Andrei
This looks very promising, I'm impressed by the work you guys have done (big
thumbs up).
There are a few issues/questions I have after reading your document:
"Therefore, command such as 'print' and 'echo' automatically convert their
arguments to the specified encoding. No automatic output encoding is
performed for anything else."
What about the other functions that output to stdout directly, such as
readfile()
and passthru()
?
"The conversion failure behavior can be customized"...
Maybe it would be a nice feature to have an U_INVALID_EXCEPTION, so that
users can actually catch the error and deal with it. Just an idea. Of course
it's not usual for the PHP core and extensions to throw exceptions, but
perhaps this could change with PHP6.
"In order to create binary string literals, a new syntax is necessary:
prefixing a string literal with letter 'b' creates a binary string."
The b-prefix for binary strings is great, but how does that work with a
function like file_get_contents()
or fread()
?
One can't do: $data = bfile_get_contents("somefile.bin");
And even if one could (somehow), wouldn't file_get_contents()
already
unicode-encode all data it reads? How does such a function know if the user
is expecting binary or textual data or does the encoding simply happen after
the string is returned? In that case it's up to the user to use the
b-prefix, but then there's the syntax problem I mentioned.
Keep up the good work,
Ron
On Wed, 10 Aug 2005 12:45:27 +0200
"Ron Korving" r.korving@xit.nl wrote:
This looks very promising, I'm impressed by the work you guys have done (big
thumbs up).There are a few issues/questions I have after reading your document:
"Therefore, command such as 'print' and 'echo' automatically convert their
arguments to the specified encoding. No automatic output encoding is
performed for anything else."
That's actually something I wanted to ask about too.
Do we really need such kind of magic?
I think it may be pretty confusing when after echo'ing or print'ing a variable
you can see one output, but after writing the very same variable into a file
you can see something completely different.
IMO it's similar to what we have with __toString() ATM.
Yes, it's documented, but it's still confusing that there is some magic
involved in one case and there is no magic in an other, almost similar case.
--
Wbr,
Antony Dovgal
Do we really need such kind of magic?
I think it may be pretty confusing when after echo'ing or print'ing a
variable
you can see one output, but after writing the very same variable into
a file
you can see something completely different.
Absolutely, we do need it. Consider that the internal encoding is
UTF-16 and outputting that directly to a terminal (or browser) is bound
to cause havoc. That's just one of the examples.
-Andrei
"In order to create binary string literals, a new syntax is necessary:
prefixing a string literal with letter 'b' creates a binary string."The b-prefix for binary strings is great, but how does that work with a
function likefile_get_contents()
orfread()
?
One can't do: $data = bfile_get_contents("somefile.bin");
fopen()
and file_get_contents()
already understands a context parameter,
specifying whethter you'd want to have binary or string/unicode data can
be done through that.
and the b syntax, only works for literal strings in your code:
b"foo", but b$foo is not going to work.
Derick
--
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org
Derick Rethans wrote:
"In order to create binary string literals, a new syntax is necessary:
prefixing a string literal with letter 'b' creates a binary string."The b-prefix for binary strings is great, but how does that work with a
function likefile_get_contents()
orfread()
?
One can't do: $data = bfile_get_contents("somefile.bin");
fopen()
andfile_get_contents()
already understands a context parameter,
specifying whethter you'd want to have binary or string/unicode data can
be done through that.
We create images in PHP scripts and pass them through with
readfile("foo,gif"). Did I understand correctly that this would still
work without changes? But echo file_get_contents("foo.gif") would fail,
right?
This is not a complaint, just trying to understand the implications,
- Chris
This looks very promising, I'm impressed by the work you guys have
done (big
thumbs up).
Thanks.
What about the other functions that output to stdout directly, such as
readfile()
andpassthru()
?
readfile()
uses streams so it would rely on stream filters and such.
passthru()
should probably operate in binary mode.
Maybe it would be a nice feature to have an U_INVALID_EXCEPTION, so
that
users can actually catch the error and deal with it. Just an idea. Of
course
it's not usual for the PHP core and extensions to throw exceptions, but
perhaps this could change with PHP6.
I think the feature of raising exceptions vs. errors is orthogonal to
what the switch does. Consider that you may want the
skip/substitute/escape performed and then raise an error or not.
The b-prefix for binary strings is great, but how does that work with a
function likefile_get_contents()
orfread()
?
One can't do: $data = bfile_get_contents("somefile.bin");
And even if one could (somehow), wouldn'tfile_get_contents()
already
unicode-encode all data it reads? How does such a function know if the
user
is expecting binary or textual data or does the encoding simply happen
after
the string is returned? In that case it's up to the user to use the
b-prefix, but then there's the syntax problem I mentioned.
'b' prefix is only for string literals. file_get_contents()
, fread()
and other streams-based functions use the default stream semantics,
meaning that unless you change the default context, the data returned
by them will be of IS_BINARY type. The default context can contain a
filter that decodes the data from the specified encoding into Unicode.
-Andrei
Andrei Zmievski wrote:
+ Determining length of Unicode strings via `strlen()` function, some simple string functions ported (substr).
It's not a problem to determine kind of char in single byte character
sets, but in the unicode with various encoding schemas I don't see easy
way how to do it.
It will be nice to have functions like this: isNumber(char),
isAlphabetic(char), isWhitespace(char) ...
It is on the plan or not?
--
Ondrej Ivanic
(ondrej@kmit.sk)
It will be nice to have functions like this: isNumber(char),
isAlphabetic(char), isWhitespace(char) ...It is on the plan or not?
its done already, just not committed yet...
clayton
""Ondrej Ivanic"" ondrej@kmit.sk wrote in message
news:4301A0D6.6000205@kmit.sk...
Andrei Zmievski wrote:
Ondrej Ivanic
(ondrej@kmit.sk)
cshmoove@bellsouth.net wrote:
It will be nice to have functions like this: isNumber(char),
isAlphabetic(char), isWhitespace(char) ...It is on the plan or not?
its done already, just not committed yet...
clayton
""Ondrej Ivanic"" ondrej@kmit.sk wrote in message
news:4301A0D6.6000205@kmit.sk...Andrei Zmievski wrote:
Ondrej Ivanic
(ondrej@kmit.sk)
Please don't use stupid caps, these are functions not methods.
Andrey
Please don't use stupid caps, these are functions not methods.
Andrey
of course not. see
http://icu.sourceforge.net/apiref/icu4c/uchar_8h.html
, but note that functions conform to PHP's function naming conventions
( lower_case_words_separated_by_underscores() ).
clayton
On Wed, 10 Aug 2005 00:31:30 -0700, in php.internals
andrei@gravitonic.com (Andrei Zmievski) wrote:
- existing PHP escape sequences are also interpreted as Unicode codepoints,
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
[..]
The single-quoted string is more restrictive than the other two types: so
far the only escape sequence allowed inside of it was ', which specifies
a literal single quote. However, single quoted strings now support the new
Unicode character escape sequences as well.
For what it's worth, would \1 be interpreted as well in single quotes
(as it currently is in double quotes)?
I suppose one of the places where \digit would be present in several
cases is in poor-written pregs - such as:
print preg_replace('/([A-Z])/','<b>\1</b>',$string);
(where \1 is used as backreference instead of \1 or $1)
I'm not that worried about my own preg-usage. I just want to be
prepared if I ever have to review some code for the purpose of
migrating to PHP6.
--
- Peter Brodersen
For what it's worth, would \1 be interpreted as well in single quotes
(as it currently is in double quotes)?
No. Only \u and \U have meaning in single quotes (in addition to
current ones).
-Andrei