Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:18332 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 46675 invoked by uid 1010); 24 Aug 2005 02:30:50 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 46660 invoked from network); 24 Aug 2005 02:30:50 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 24 Aug 2005 02:30:50 -0000 X-Host-Fingerprint: 148.87.122.32 rgminet03.oracle.com Linux 2.4/2.6 Received: from ([148.87.122.32:28963] helo=rgminet03.oracle.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 60/EA-28235-9DBDB034 for ; Tue, 23 Aug 2005 22:30:50 -0400 Received: from rgmgw1.us.oracle.com (rgmgw1.us.oracle.com [138.1.186.110]) by rgminet03.oracle.com (Switch-3.1.6/Switch-3.1.7) with ESMTP id j7O2Uj6M014606 for ; Tue, 23 Aug 2005 20:30:45 -0600 Received: from localhost (localhost [127.0.0.1]) by rgmgw1.us.oracle.com (Switch-3.1.4/Switch-3.1.0) with SMTP id j7O2UjRL009447 for ; Tue, 23 Aug 2005 20:30:45 -0600 Received: from [130.35.48.248] (mtozawa-pc2.us.oracle.com [130.35.48.248]) by rgmgw1.us.oracle.com (Switch-3.1.4/Switch-3.1.0) with ESMTP id j7O2Uiow009422; Tue, 23 Aug 2005 20:30:44 -0600 Message-ID: <430BDBAC.70701@oracle.com> Date: Tue, 23 Aug 2005 19:30:04 -0700 User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: en-us, en MIME-Version: 1.0 To: PHP Developers Mailing List CC: christopher.jones@oracle.com Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAQAAAAI= X-Whitelist: TRUE Subject: Re: PHP Unicode support design document From: makoto.tozawa@oracle.com (Makoto Tozawa) This looks good. When the unicode_semantics switch is turned on, it provides Unicode everywhere development solution to PHP developers. The output, input, and script encodings are all utf-8 by default. The internal encoding is utf-16 which allows developers to handle surrogate pairs correctly. I have one issue and one question. "HTTP Input Encoding ... If the HTTP request contains the encoding specification in the headers, then it will be used instead of this setting." With my best knowledge there isn't such http request header which specifies the encoding of the request. In case the intent is to honor the ACCEPT-CHARSET, it may cause a problem because browsers don't gurantee the encoding in the ACCEPT-CHARSET is same as the encoding used to escape characters in the URL query string. After all, the ACCEPT-CHARSET is to specify the character encodings acceptable for the response. "Upgrading Existing Functions" It seems that all the existing functions need to be upgraded to work properly when unicode_semantics switch is turned on becuase it changes the semantics of fundamental functions. I'm assuming all the existing functions don't work properly if fundamental functions such as strlen() behave differently. Is there any way to keep the byte semantics (in oppose to unicode semantics) only for the existing functions? For example, the Oracle 8 functions can be configured to use utf-8 for the character encoding of strings. In order for them to work properly, fundamental functions, which Oracle 8 function call, have to behave in byte samentics. And if they work properly when the unicode semantics switch is turned on, by setting the runtime_encoding to utf-8, they can be called by uncode applications. Makoto