Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:63768 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 52743 invoked from network); 6 Nov 2012 21:46:36 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Nov 2012 21:46:36 -0000 Authentication-Results: pb1.pair.com header.from=felipensp@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=felipensp@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.83.42 as permitted sender) X-PHP-List-Original-Sender: felipensp@gmail.com X-Host-Fingerprint: 74.125.83.42 mail-ee0-f42.google.com Received: from [74.125.83.42] ([74.125.83.42:61720] helo=mail-ee0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 59/D3-25645-93589905 for ; Tue, 06 Nov 2012 16:46:33 -0500 Received: by mail-ee0-f42.google.com with SMTP id l10so579337eei.29 for ; Tue, 06 Nov 2012 13:46:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=C4P5jXIBHJr+Ko+TOuGgJ+F1IdNEUzE8Aert78P3PXY=; b=U8gBFYMDBeXhA+iLiPEk1y72MTa9J/jiRoNiy9ZX4kAPOZa/FBJgSaN/KfpU4Q5dVN /5Mwkrb0IEGCV0NohrQTLRPcv3srolbHM7l3YS70bilyh3q25k7MOznC5Wzj2KP01Ojg 6isI/cqmIUh9Kfw2twXybV5xCvc+qPmlyGCSzoZcG7NaImXe8AUIXy32/GpDGPdhTmu5 nZ7rt95zEyKC4y5GeadBvyxqe+U+Mss1ZcgBp+UHvuPACbjNj/r1O8D4wGdVWVpetLjL 5nvQwWHBDyOeIz3RhIG1w5juePhhFwO++hN9LP0EuddvU/qE1b7QFv+7clZ3gbtcPkIC C0og== Received: by 10.14.184.1 with SMTP id r1mr7980403eem.4.1352238390651; Tue, 06 Nov 2012 13:46:30 -0800 (PST) MIME-Version: 1.0 Received: by 10.14.4.199 with HTTP; Tue, 6 Nov 2012 13:46:10 -0800 (PST) In-Reply-To: References: <5fce29a0cb5467c00eeb267dd38fd788@localhost> <5097E376.6040709@lerdorf.com> <5097EF8A.1000809@lerdorf.com> Date: Tue, 6 Nov 2012 19:46:10 -0200 Message-ID: To: Philip Olson Cc: Rasmus Lerdorf , =?UTF-8?Q?Jean=2DS=C3=A9bastien_Hedde?= , internals Content-Type: multipart/alternative; boundary=047d7b343ce6ed29ef04cdda8878 Subject: Re: [PHP-DEV] Incomprehension with preg_match and utf8 From: felipensp@gmail.com (Felipe Pena) --047d7b343ce6ed29ef04cdda8878 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi guys, 2012/11/6 Philip Olson > > On Nov 5, 2012, at 8:55 AM, Rasmus Lerdorf wrote: > > > On 11/05/2012 08:41 AM, Jean-S=C3=A9bastien Hedde wrote: > >> On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf > >> wrote: > >>> > >>> I think the documentation is wrong on that. In Unicode mode [[:alnum:= ]] > >>> actually becomes \p{Xan} which should match Unicode chars as well, bu= t > >>> only if PCRE was compiled with Unicode support. So I suspect you don'= t > >>> actually have a Unicode-capable PCRE build in some cases there. > >>> > >>> -Rasmus > >> > >> I will report the bug to the package maintainers (remi, debian too...)= . > >> > >> Is there anyway for us to avoid those "wrong" builds ? > > > > I don't see how. > > > Hi geeks, > > Does anyone have a suggestion on how the documentation should be > updated? The quote is from here: > > http://php.net/manual/en/regexp.reference.character-classes.php > > With the quote being: > > "In UTF-8 mode, characters with values greater than 128 do > not match any of the POSIX character classes." > > A few simple/related facts: > > - PCRE_UCP exists as of PCRE 8.10 > - Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure > what PHP version, and googling for "87a237342" turns up empty, > and I miss SVN version numbers) > > Anyway, how should this be documented? > > Regards, > Philip > > I added PCRE_UCP on PHP 5.3.4 as a fix for bug #52971. [1] For documentation just say something like: "In unicode mode the unicode properties are used instead to classify characters of some classes." More information extracted from PCRE documentation [2]: ---------8<------------------------------- [:alnum:] becomes \p{Xan} [:alpha:] becomes \p{L} [:blank:] becomes \h [:digit:] becomes \p{Nd} [:lower:] becomes \p{Ll} [:space:] becomes \p{Xps} [:upper:] becomes \p{Lu} [:word:] becomes \p{Xwd} Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX classes are unchanged, and match only characters with code points less than 128. --------------------------------------------- [1] - http://svn.php.net/viewvc/?view=3Drevision&revision=3D303963 [2] - http://pcre.org/man.txt --=20 Regards, Felipe Pena --047d7b343ce6ed29ef04cdda8878--