Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38164 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 39546 invoked from network); 11 Jun 2008 18:02:07 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Jun 2008 18:02:07 -0000 Authentication-Results: pb1.pair.com header.from=Tex.Texin@netapp.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=Tex.Texin@netapp.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain netapp.com designates 216.240.18.37 as permitted sender) X-PHP-List-Original-Sender: Tex.Texin@netapp.com X-Host-Fingerprint: 216.240.18.37 mx2.netapp.com Received: from [216.240.18.37] ([216.240.18.37:21932] helo=mx2.netapp.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 6D/02-26183-E1310584 for ; Wed, 11 Jun 2008 14:02:06 -0400 X-IronPort-AV: E=Sophos;i="4.27,625,1204531200"; d="scan'208";a="92061347" Received: from smtp2.corp.netapp.com ([10.57.159.114]) by mx2-out.netapp.com with ESMTP; 11 Jun 2008 11:01:58 -0700 Received: from svlexrs01.hq.netapp.com (svlexrs01.corp.netapp.com [10.57.156.158]) by smtp2.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id m5BI1uMd026037; Wed, 11 Jun 2008 11:01:57 -0700 (PDT) Received: from SACEXMV02.hq.netapp.com ([10.99.190.109]) by svlexrs01.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.1830); Wed, 11 Jun 2008 11:01:52 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Wed, 11 Jun 2008 11:01:41 -0700 Message-ID: <819912BDAE6BCB4097883B226DA473B10B0ACB6B@SACEXMV02.hq.netapp.com> In-Reply-To: <1213187529.21247.9.camel@goldfinger.johannes.nop> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PHP-DEV] Algorithm Optimizations - string search Thread-Index: AcjLvzhOy0/oufzxT1606ir0lzcUjAALV9ww References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> <819912BDAE6BCB4097883B226DA473B10B0AC8B4@SACEXMV02.hq.netapp.com> <1213187529.21247.9.camel@goldfinger.johannes.nop> To: =?iso-8859-1?Q?Johannes_Schl=FCter?= Cc: "Scott MacVicar" , "Nuno Lopes" , , "Michal Dziemianko" X-OriginalArrivalTime: 11 Jun 2008 18:01:52.0798 (UTC) FILETIME=[3A6D2BE0:01C8CBED] Subject: RE: [PHP-DEV] Algorithm Optimizations - string search From: Tex.Texin@netapp.com ("Texin, Tex") Ok, well then the code needs to use internationalized functions for = string upper and lower. Operating on the first character of the string without surrounding = context is incorrect. Operating on the string without locale is also incorrect. The string operations should use ICU. Also, ICU uses boyer-moore I believe. (Or it did last time I looked.) Some other issues as well, but I will have to look at the code. I wasn't thinking utf-16, so you might also look at surrogates. Are there guidelines for php coding, and proper support for utf-16? > -----Original Message----- > From: Johannes Schl=FCter [mailto:johannes@schlueters.de]=20 > Sent: Wednesday, June 11, 2008 5:32 AM > To: Texin, Tex > Cc: Scott MacVicar; Nuno Lopes; internals@lists.php.net;=20 > Michal Dziemianko > Subject: RE: [PHP-DEV] Algorithm Optimizations - string search >=20 > Hi, >=20 > On Wed, 2008-06-11 at 01:01 -0700, Texin, Tex wrote: > > When I looked at the code, I assumed that it wasn't intended for=20 > > international use I'll have to go back and look to give you=20 > details, but it doesn't work for international use or unicode. > > It would be ok for 8859-1.=20 >=20 > That's the default case in PHP < 6, in current PHP versions=20 > all string operations use on "binary" strings, so all=20 > references to offset work on byte not character base. That's=20 > one of the main reasons for PHP 6. >=20 > johannes >=20 >=20