PHP Unicode support design document

20 years ago by Makoto Tozawa — view source — reply

unread

This looks good. When the unicode_semantics switch is turned on, it
provides Unicode everywhere development solution to PHP developers.
The output, input, and script encodings are all utf-8 by default.
The internal encoding is utf-16 which allows developers to handle
surrogate pairs correctly.

I have one issue and one question.

"HTTP Input Encoding
...
If the HTTP request contains the encoding specification in the headers,
then it will be used instead of this setting."

With my best knowledge there isn't such http request header which
specifies the encoding of the request. In case the intent is to honor
the ACCEPT-CHARSET, it may cause a problem because browsers don't
gurantee the encoding in the ACCEPT-CHARSET is same as the encoding
used to escape characters in the URL query string. After all, the
ACCEPT-CHARSET is to specify the character encodings acceptable for
the response.

"Upgrading Existing Functions"

It seems that all the existing functions need to be upgraded to work
properly
when unicode_semantics switch is turned on becuase it changes the
semantics of
fundamental functions. I'm assuming all the existing functions don't work
properly if fundamental functions such as strlen() behave differently.

Is there any way to keep the byte semantics (in oppose to unicode
semantics)
only for the existing functions? For example, the Oracle 8 functions can be
configured to use utf-8 for the character encoding of strings. In order for
them to work properly, fundamental functions, which Oracle 8 function call,
have to behave in byte samentics. And if they work properly when the
unicode
semantics switch is turned on, by setting the runtime_encoding to utf-8,
they can be called by uncode applications.

Makoto

20 years ago by Andrei Zmievski — view source — reply

unread

Hi,

"HTTP Input Encoding
...
If the HTTP request contains the encoding specification in the headers,
then it will be used instead of this setting."

With my best knowledge there isn't such http request header which
specifies the encoding of the request. In case the intent is to honor
the ACCEPT-CHARSET, it may cause a problem because browsers don't
gurantee the encoding in the ACCEPT-CHARSET is same as the encoding
used to escape characters in the URL query string. After all, the
ACCEPT-CHARSET is to specify the character encodings acceptable for
the response.

I took a closer look at this today and RFC 2616 does not specify
whether user agents are supposed to send a charset parameter in the
Content-Type header of the POST request. I did not see any of my
browsers doing so. I think we can safely disregard this and rely on
http_input_encoding and output_encoding settings. We are not going to
use Accept-Charset for the reasons you mention.

Is there any way to keep the byte semantics (in oppose to unicode
semantics)
only for the existing functions? For example, the Oracle 8 functions
can be
configured to use utf-8 for the character encoding of strings. In
order for
them to work properly, fundamental functions, which Oracle 8 function
call,
have to behave in byte samentics. And if they work properly when the
unicode
semantics switch is turned on, by setting the runtime_encoding to
utf-8,
they can be called by uncode applications.

I couldn't parse this on the first try. Could you restate this?

-Andrei

20 years ago by Makoto Tozawa — view source — reply

unread

Andrei Zmievski wrote:

Is there any way to keep the byte semantics (in oppose to unicode
semantics)
only for the existing functions? For example, the Oracle 8 functions
can be
configured to use utf-8 for the character encoding of strings. In
order for
them to work properly, fundamental functions, which Oracle 8 function
call,
have to behave in byte samentics. And if they work properly when the
unicode
semantics switch is turned on, by setting the runtime_encoding to utf-8,
they can be called by uncode applications.

I couldn't parse this on the first try. Could you restate this?

Say there is a function which calls strlen($s) expecting it returns byte
size of $s,
and it is working fine when $s constains multibyte characters. For example,
the function expects strlen('áéí') returns 6 when the encoding is utf-8.
If this function is called by Uniocde ready applications on
Unicode-enabled PHP,
it will fall into error because strlen('áéí') will return 3.

Is there any way to let strlen('áéí') return 6 only when it is called by
the existing function?

Hope I explained well this time.

Makoto

20 years ago by Adam Maccabee Trachtenberg — view source — reply

unread

I took a closer look at this today and RFC 2616 does not specify
whether user agents are supposed to send a charset parameter in the
Content-Type header of the POST request. I did not see any of my
browsers doing so. I think we can safely disregard this and rely on
http_input_encoding and output_encoding settings. We are not going to
use Accept-Charset for the reasons you mention.

I don't know if this is useful, but Sam Ruby did a bunch of digging
into HTTP/HTML/XML encodings and precedence rules. See this
presentation -- Slides 72 - 75. (Note link below starts at slide 72.)

http://intertwingly.net/slides/2005/etcon/72.html

If possible, I would prefer not to assume that we only need to follow
the behavior of popular user agents.

Someone could be using PHP as a web service server and have people
write client scripts submitting all kinds of POST data that needs to
be processed correctly. In an ideal world (heh), PHP would handle all
of those scripts as long as they followed the specifications.

-adam

--
adam@trachtenberg.com | http://www.trachtenberg.com
author of o'reilly's "upgrading to php 5" and "php cookbook"
avoid the holiday rush, buy your copies today!