Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:36361 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 7888 invoked from network); 23 Mar 2008 04:06:59 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Mar 2008 04:06:59 -0000 Authentication-Results: pb1.pair.com smtp.mail=stas@zend.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=stas@zend.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain zend.com designates 63.205.162.117 as permitted sender) X-PHP-List-Original-Sender: stas@zend.com X-Host-Fingerprint: 63.205.162.117 us-mr1.zend.com Linux 2.4/2.6 Received: from [63.205.162.117] ([63.205.162.117:40262] helo=us-mr1.zend.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 33/76-16027-067D5E74 for ; Sat, 22 Mar 2008 23:06:57 -0500 Received: from us-ex1.zend.com (us-ex1.zend.net [192.168.16.5]) by us-mr1.zend.com (Postfix) with ESMTP id 713C6E1234; Sat, 22 Mar 2008 22:04:26 -0700 (PDT) Received: from [192.168.17.66] ([192.168.17.66]) by us-ex1.zend.com with Microsoft SMTPSVC(6.0.3790.3959); Sat, 22 Mar 2008 21:07:22 -0700 Message-ID: <47E5D75D.9010108@zend.com> Date: Sat, 22 Mar 2008 21:06:53 -0700 Organization: Zend Technologies User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: Marcus Boerger Cc: internals@lists.php.net, Alan Knowles , Andi Gutmans , Rui Hirokawa , Johannes Schlueter References: <1706278209.20080302232134@marcus-boerger.de> <47CB3CDC.8050006@akbkhome.com> <1789182684.20080303113429@marcus-boerger.de> <47CC16FC.7080802@akbkhome.com> <497326103.20080322152343@marcus-boerger.de> In-Reply-To: <497326103.20080322152343@marcus-boerger.de> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-OriginalArrivalTime: 23 Mar 2008 04:07:22.0656 (UTC) FILETIME=[653FD200:01C88C9B] Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: stas@zend.com (Stanislav Malyshev) > is broken code and not a single test. If this is not going to change as in > we are not getting any .phpt files for this feature then there are two As I understand the theory of the thing should be pretty simple, you set input encoding (by config or declare) and internal encoding, and then when script is being read, you convert it from input to internal. However, it appears that since flex couldn't stomach certain encodings, there's also a hack there - script is translated from input to some "safe" encoding for flex, and then strings are translated back to "internal" encoding after flex processes them. If re2c can deal with encodings like SJIS without trouble then some of the hacks might be unnecessary. I think encodings that need to be checked are those in zend_multibyte.c that have "compatible" flag off. Here's a short script example I found that shows what's the problem there: Character echoed there is U+30BD "Katakana letter SO". Now if you run it in UTF-8, works good. However, if you recode it to Shift-JIS, it won't run, since this script looks to the parser this way: \'; ?> (that's dump of VI output, so replace <83> with actual 0x83 if you compose it). That's parse error for the parser, if parsed "naively". So somehow the parser needs to know 0x83+\ is actually U+30BD and at the same time the user still might want it as 0x83+\ in a zval (or maybe as utf-8 - it depends on him). -- Stanislav Malyshev, Zend Software Architect stas@zend.com http://www.zend.com/ (408)253-8829 MSN: stas@zend.com