Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:6371 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 25550 invoked by uid 1010); 12 Dec 2003 20:15:50 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 25494 invoked from network); 12 Dec 2003 20:15:50 -0000 Received: from unknown (HELO vckyb2.nw.wakwak.com) (211.9.230.145) by pb1.pair.com with SMTP; 12 Dec 2003 20:15:50 -0000 Received: from at.wakwak.com (at.wakwak.com [211.9.230.135]) by vckyb2.nw.wakwak.com (Postfix) with ESMTP id 00DDF4010A; Sat, 13 Dec 2003 05:15:48 +0900 (JST) Received: from [192.168.0.130] (z152.218-225-128.ppp.wakwak.ne.jp [218.225.128.152]) by at.wakwak.com (8.12.10/8.12.10/2003-09-30) with ESMTP/inet id hBCKFmng074691; Sat, 13 Dec 2003 05:15:48 +0900 (JST) (envelope-from moriyoshi@at.wakwak.com) In-Reply-To: <200312121509.19291.ilia@prohost.org> References: <25BBBBC2-2CD2-11D8-8FCC-000A95CE0C62@at.wakwak.com> <200312121442.54406.ilia@prohost.org> <0E6C8F5C-2CDB-11D8-9F07-000A95CE0C62@at.wakwak.com> <200312121509.19291.ilia@prohost.org> Mime-Version: 1.0 (Apple Message framework v606) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-ID: Content-Transfer-Encoding: 7bit Cc: PHP Internals Date: Sat, 13 Dec 2003 05:15:38 +0900 To: ilia@prohost.org X-Mailer: Apple Mail (2.606) Subject: Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch) From: moriyoshi@at.wakwak.com (Moriyoshi Koizumi) On 2003/12/13, at 5:09, Ilia Alshanetsky wrote: > I mentioning this now because we are considering changes to the > function in > the development branch, which is a fine time to resolve any > deficiencies. Okay, fine :) > The added functionality, which if I understand correctly is support for > multibyte delimeters and enclosures is great. But it hardly explains a The change was not for multibyte delimiters and enclosures. The current implementation still allows only single-byte characters for the delimiter and enclosure. I was able to add such a capability as well, but I didn't because it appeared to fairly slow it down. As several multibyte encodings like CP932, CP936, CP949, CP950 and Shift_JIS may map a value in range of 0x40 - 0xfe to the second byte, which had been a problem. Therefore we need to check if a octet of a certain position belongs to a multibyte character or not and this fact motivated me to bring a scanner-like finite-state machine implementation into fgetcsv() (and basename()). See http://www.microsoft.com/globaldev/reference/WinCP.mspx for detail. > significant performance disparity I am seeing. I believe much of the > problem > can be solved by moving from manual string iteration to one using C > library > functions such as memchr(). When parsing non-multibyte text there > shouldn't > be more then 10-15% performance loss. > I should mention that benchmarks were made using time utility, so > advantages > offered by PHP 5's speedups were discounted. Had they been considered > the > speed loss would've been 300% or more. If we limited the support to UTF-8 or EUC encoding only, we'd be able to drastically gain much better performance. But it won't actually solve practical problems where it is in action. Moriyoshi