Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:31916 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 41641 invoked by uid 1010); 26 Aug 2007 12:50:05 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 41626 invoked from network); 26 Aug 2007 12:50:05 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 26 Aug 2007 12:50:04 -0000 Authentication-Results: pb1.pair.com smtp.mail=rui_hirokawa@ybb.ne.jp; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=rui_hirokawa@ybb.ne.jp; sender-id=pass; domainkeys=good Received-SPF: pass (pb1.pair.com: domain ybb.ne.jp designates 124.83.153.125 as permitted sender) DomainKey-Status: good X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: rui_hirokawa@ybb.ne.jp X-Host-Fingerprint: 124.83.153.125 ybbsmtp05.mail.ogk.yahoo.co.jp FreeBSD 4.7-5.2 (or MacOS X 10.2-10.3) (2) Received: from [124.83.153.125] ([124.83.153.125:24429] helo=ybbsmtp05.mail.ogk.yahoo.co.jp) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 08/06-34242-9F671D64 for ; Sun, 26 Aug 2007 08:50:04 -0400 Received: (qmail 61812 invoked by alias); 26 Aug 2007 12:49:58 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=ybb20050223; d=ybb.ne.jp; b=z/w4eaQG0zbHlSJxw2puw3mOG8VC6BQWbOSapQWSGmK3/4NJwe4FbJod6W1V9rvHojU0gjjCq3DlVRz2umxAtL8vUr9B9jt6uhYIA6Fu400c7Rb1aLNQROhShgEWUZIA ; Received: from unknown (HELO ?192.168.1.198?) (219.204.92.5 with poptime) by ybbsmtp05.mail.ogk.yahoo.co.jp with SMTP; 26 Aug 2007 12:49:58 -0000 X-Apparently-From: X-yjpVirusScan: Scanned Date: Sun, 26 Aug 2007 21:49:52 +0900 To: LAUPRETRE =?ISO-2022-JP?B?RnJhbhskQm1QGyhCaXM=?= (P) Cc: "PHP Internals List" , "Gregory Beaver" , "Marcus Boerger" In-Reply-To: References: Message-ID: <20070826214046.4495.RUI_HIROKAWA@ybb.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.31 [ja] Subject: Re: [PHP-DEV] [PATCH] zend-multibyte unicode detection vs. __halt_compiler() From: rui_hirokawa@ybb.ne.jp (Rui Hirokawa) Hi, IMHO, #42396 is not a bug, but it is the specification. The normal script doesn't contain a null byte if it is not encoded in Unicode. It is understandable the addition of a unique byte seqence '0xFFFFFFFF' detection to support PHAR/PHK, but it is a change to add a new feature. Rui On Thu, 23 Aug 2007 18:58:52 +0200 LAUPRETRE Fran輟is (P) wrote: > Hi, > > Here is a patch I am submitting to fix bug #42396 (PHP 5). > > The problem: when PHP is configured with the '--enable-zend-multibyte' option, it tries to autodetect unicode-encoded scripts. Then, if a script contains null bytes after an __halt_compiler() directive, it will be considered as UTF-16 or 32, and the execution typically results in a lot of '?' garbage. In practice, it makes PHK and PHAR incompatible with the zend-multibyte feature. > > The only workaround was to turn off the (undocumented) 'detect_unicode' flag. But it is not a real solution, as people may want to use unicode detection along with PHK/PHAR packages, and there's no logical reason to keep them incompatible. > > The patch I am submitting assumes that a document encoded in UTF-8, UTF-16, or UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small detection loop before scanning the script for null bytes. If a sequence of 4 0xff is found, the unicode detection is aborted and the script is considered as non unicode, whatever other binary data it can contain. Of course, this detection happens after looking for a byte-order mark. > > Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the __halt_compiler() directive, which makes the generated PHK archives compatible with zend-multibyte. The same for PHAR. > > It would be better if we could scan the script for null bytes only up to the __halt_compiler() directive, but I suspect it to be impossible as it is not yet compiled... > > Regards > > Francois > > --- zend_multibyte.c.old 2007-01-01 10:35:46.000000000 +0100 > +++ zend_multibyte.c 2007-08-23 17:22:24.000000000 +0200 > @@ -1035,6 +1035,7 @@ > zend_encoding *script_encoding = NULL; > int bom_size; > char *script; > + unsigned char *p,*p_end; > > if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) { > return NULL; > @@ -1069,6 +1070,18 @@ > return script_encoding; > } > > + /* Search for four 0xff bytes - if found, script cannot be unicode */ > + > + p=(unsigned char *)LANG_SCNG(script_org); > + p_end=(p+LANG_SCNG(script_org_size)-3); > + while (p < p_end) { > + if ( ((* p) ==(unsigned char)0x0ff) > + && ((*(p+1))==(unsigned char)0x0ff) > + && ((*(p+2))==(unsigned char)0x0ff) > + && ((*(p+3))==(unsigned char)0x0ff)) return NULL; > + p++; > + } > + > /* script contains NULL bytes -> auto-detection */ > if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) { > /* make best effort if BOM is missing */ > -- Rui Hirokawa