Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:36378 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 81456 invoked from network); 23 Mar 2008 14:00:37 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Mar 2008 14:00:37 -0000 Authentication-Results: pb1.pair.com header.from=helly@php.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=helly@php.net; spf=unknown; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 85.214.94.56 as permitted sender) X-PHP-List-Original-Sender: helly@php.net X-Host-Fingerprint: 85.214.94.56 aixcept.net Linux 2.6 Received: from [85.214.94.56] ([85.214.94.56:60862] helo=h1149922.serverkompetenz.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 44/1C-10593-18266E74 for ; Sun, 23 Mar 2008 09:00:36 -0500 Received: from MBOERGER-ZRH.corp.google.com (72-216.1-85.cust.bluewin.ch [85.1.216.72]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by h1149922.serverkompetenz.net (Postfix) with ESMTP id 7F2CF11DB4B; Sun, 23 Mar 2008 15:00:30 +0100 (CET) Date: Sun, 23 Mar 2008 14:57:29 +0100 Reply-To: Marcus Boerger X-Priority: 3 (Normal) Message-ID: <1527921560.20080323145729@marcus-boerger.de> To: Stanislav Malyshev CC: Marcus Boerger , internals@lists.php.net, Alan Knowles , Andi Gutmans , Rui Hirokawa , Johannes Schlueter In-Reply-To: <47E5D75D.9010108@zend.com> References: <1706278209.20080302232134@marcus-boerger.de> <47CB3CDC.8050006@akbkhome.com> <1789182684.20080303113429@marcus-boerger.de> <47CC16FC.7080802@akbkhome.com> <497326103.20080322152343@marcus-boerger.de> <47E5D75D.9010108@zend.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: helly@php.net (Marcus Boerger) Hello Stanislav, cool, care to change the code snippet into a test as I've done for Rui's snippet? marcus Sunday, March 23, 2008, 5:06:53 AM, you wrote: >> is broken code and not a single test. If this is not going to change as in >> we are not getting any .phpt files for this feature then there are two > As I understand the theory of the thing should be pretty simple, you set > input encoding (by config or declare) and internal encoding, and then > when script is being read, you convert it from input to internal. > However, it appears that since flex couldn't stomach certain encodings, > there's also a hack there - script is translated from input to some > "safe" encoding for flex, and then strings are translated back to > "internal" encoding after flex processes them. If re2c can deal with > encodings like SJIS without trouble then some of the hacks might be > unnecessary. I think encodings that need to be checked are those in > zend_multibyte.c that have "compatible" flag off. > Here's a short script example I found that shows what's the problem there: > > Character echoed there is U+30BD "Katakana letter SO". Now if you run it > in UTF-8, works good. However, if you recode it to Shift-JIS, it won't > run, since this script looks to the parser this way: > \'; ?> > (that's dump of VI output, so replace <83> with actual 0x83 if you > compose it). That's parse error for the parser, if parsed "naively". So > somehow the parser needs to know 0x83+\ is actually U+30BD and at the > same time the user still might want it as 0x83+\ in a zval (or maybe as > utf-8 - it depends on him). > -- > Stanislav Malyshev, Zend Software Architect > stas@zend.com http://www.zend.com/ > (408)253-8829 MSN: stas@zend.com Best regards, Marcus