Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:31870 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 66682 invoked by uid 1010); 23 Aug 2007 16:59:19 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 66667 invoked from network); 23 Aug 2007 16:59:19 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Aug 2007 16:59:19 -0000 Authentication-Results: pb1.pair.com smtp.mail=francois.laupretre@ratp.fr; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=francois.laupretre@ratp.fr; sender-id=unknown Received-SPF: error (pb1.pair.com: domain ratp.fr from 62.160.169.8 cause and error) X-PHP-List-Original-Sender: francois.laupretre@ratp.fr X-Host-Fingerprint: 62.160.169.8 unknown Solaris 8 (2) Received: from [62.160.169.8] ([62.160.169.8:27475] helo=odii-smtp.ratp.fr) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id D6/70-62655-4ECBDC64 for ; Thu, 23 Aug 2007 12:59:18 -0400 Received: from bl03ic06.info.ratp (unknown [188.20.209.10]) by odii-smtp.ratp.fr (Postfix) with ESMTP id 09E5E26B61; Thu, 23 Aug 2007 18:59:09 +0200 (MEST) Received: from EXCHANGE04.info.ratp ([188.20.209.2]) by bl03ic06.info.ratp with Microsoft SMTPSVC(6.0.3790.2499); Thu, 23 Aug 2007 18:58:53 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Thu, 23 Aug 2007 18:58:52 +0200 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH] zend-multibyte unicode detection vs. __halt_compiler() Thread-Index: AcflpuI+Z/eXIbP7SL26mMvPHIOt1w== To: "PHP Internals List" Cc: "Gregory Beaver" , "Marcus Boerger" X-OriginalArrivalTime: 23 Aug 2007 16:58:53.0127 (UTC) FILETIME=[E2886570:01C7E5A6] Subject: [PATCH] zend-multibyte unicode detection vs. __halt_compiler() From: francois.laupretre@ratp.fr (=?iso-8859-1?Q?LAUPRETRE_Fran=E7ois_=28P=29?=) Hi, Here is a patch I am submitting to fix bug #42396 (PHP 5). The problem: when PHP is configured with the '--enable-zend-multibyte' = option, it tries to autodetect unicode-encoded scripts. Then, if a = script contains null bytes after an __halt_compiler() directive, it will = be considered as UTF-16 or 32, and the execution typically results in a = lot of '?' garbage. In practice, it makes PHK and PHAR incompatible with = the zend-multibyte feature. The only workaround was to turn off the (undocumented) 'detect_unicode' = flag. But it is not a real solution, as people may want to use unicode = detection along with PHK/PHAR packages, and there's no logical reason to = keep them incompatible. The patch I am submitting assumes that a document encoded in UTF-8, = UTF-16, or UTF-32 cannot contain a sequence of four 0xff bytes. So, it = adds a small detection loop before scanning the script for null bytes. = If a sequence of 4 0xff is found, the unicode detection is aborted and = the script is considered as non unicode, whatever other binary data it = can contain. Of course, this detection happens after looking for a = byte-order mark. Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the = __halt_compiler() directive, which makes the generated PHK archives = compatible with zend-multibyte. The same for PHAR. It would be better if we could scan the script for null bytes only up to = the __halt_compiler() directive, but I suspect it to be impossible as it = is not yet compiled... Regards Francois --- zend_multibyte.c.old 2007-01-01 10:35:46.000000000 +0100 +++ zend_multibyte.c 2007-08-23 17:22:24.000000000 +0200 @@ -1035,6 +1035,7 @@ zend_encoding *script_encoding =3D NULL; int bom_size; char *script; + unsigned char *p,*p_end; =20 if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) { return NULL; @@ -1069,6 +1070,18 @@ return script_encoding; } =20 + /* Search for four 0xff bytes - if found, script cannot be = unicode */ + + p=3D(unsigned char *)LANG_SCNG(script_org); + p_end=3D(p+LANG_SCNG(script_org_size)-3); + while (p < p_end) { + if ( ((* p) =3D=3D(unsigned char)0x0ff) + && ((*(p+1))=3D=3D(unsigned char)0x0ff) + && ((*(p+2))=3D=3D(unsigned char)0x0ff) + && ((*(p+3))=3D=3D(unsigned char)0x0ff)) return = NULL; + p++; + } + /* script contains NULL bytes -> auto-detection */ if (memchr(LANG_SCNG(script_org), 0, = LANG_SCNG(script_org_size))) { /* make best effort if BOM is missing */