Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:18332
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Message-ID: <430BDBAC.70701@oracle.com>
Date: Tue, 23 Aug 2005 19:30:04 -0700
User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)
MIME-Version: 1.0
To: PHP Developers Mailing List <internals@lists.php.net>
CC: christopher.jones@oracle.com
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: PHP Unicode support design document
From: makoto.tozawa@oracle.com (Makoto Tozawa)

This looks good. When the unicode_semantics switch is turned on, it
provides Unicode everywhere development solution to PHP developers.
The output, input, and script encodings are all utf-8 by default.
The internal encoding is utf-16 which allows developers to handle
surrogate pairs correctly.

I have one issue and one question.

"HTTP Input Encoding
 ...
If the HTTP request contains the encoding specification in the headers,
then it will be used instead of this setting."

With my best knowledge there isn't such http request header which
specifies the encoding of the request. In case the intent is to honor
the ACCEPT-CHARSET, it may cause a problem because browsers don't
gurantee the encoding in the ACCEPT-CHARSET is same as the encoding
used to escape characters in the URL query string. After all, the
ACCEPT-CHARSET is to specify the character encodings acceptable for
the response.


"Upgrading Existing Functions"

It seems that all the existing functions need to be upgraded to work 
properly
when unicode_semantics switch is turned on becuase it changes the 
semantics of
fundamental functions. I'm assuming all the existing functions don't work
properly if fundamental functions such as strlen() behave differently.

Is there any way to keep the byte semantics (in oppose to unicode 
semantics)
only for the existing functions? For example, the Oracle 8 functions can be
configured to use utf-8 for the character encoding of strings. In order for
them to work properly, fundamental functions, which Oracle 8 function call,
have to behave in byte samentics. And if they work properly when the 
unicode
semantics switch is turned on, by setting the runtime_encoding to utf-8,
they can be called by uncode applications.


Makoto