Hi,
I'm facing an issue with preg_match and an UTF8 string.
The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : Régis
If I read the manual preg_match should return 0 ("In UTF-8 mode,
characters with values greater than 128 do not match any of the POSIX
character classes.") but I've got 1 in some case :
On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1
On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1
On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)
How this can be possible ?
Regards,
Jean-Sébastien Hedde
au-fil-du.net
Em 2012-11-05 10:57, Jean-Sébastien Hedde escreveu:
I'm facing an issue with preg_match and an UTF8 string.
The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : RégisIf I read the manual preg_match should return 0 ("In UTF-8 mode,
characters with values greater than 128 do not match any of the POSIX
character classes.") but I've got 1 in some case :
The documentation is simply out-of-date. We have set PCRE_UCP if the
'u' modifier is present for some time now (since 87a237342, 3 Oct 2010).
Look for PCRE_UCP in http://www.pcre.org/pcre.txt to know the
implications.
--
Gustavo Lopes
Hi,
I'm facing an issue with preg_match and an UTF8 string.
The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : RégisIf I read the manual preg_match should return 0 ("In UTF-8 mode,
characters with values greater than 128 do not match any of the POSIX
character classes.") but I've got 1 in some case :On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)How this can be possible ?
I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.
-Rasmus
On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:
I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.-Rasmus
I will report the bug to the package maintainers (remi, debian too...).
Is there anyway for us to avoid those "wrong" builds ?
Regards,
Jean-Sébastien Hedde
au-fil-du.net
On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.-Rasmus
I will report the bug to the package maintainers (remi, debian too...).
Is there anyway for us to avoid those "wrong" builds ?
I don't see how.
On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.-Rasmus
I will report the bug to the package maintainers (remi, debian too...).
Is there anyway for us to avoid those "wrong" builds ?
I don't see how.
Hi geeks,
Does anyone have a suggestion on how the documentation should be
updated? The quote is from here:
http://php.net/manual/en/regexp.reference.character-classes.php
With the quote being:
"In UTF-8 mode, characters with values greater than 128 do
not match any of the POSIX character classes."
A few simple/related facts:
- PCRE_UCP exists as of PCRE 8.10
- Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure
what PHP version, and googling for "87a237342" turns up empty,
and I miss SVN version numbers)
Anyway, how should this be documented?
Regards,
Philip
[...]
A few simple/related facts:[...]
- Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure
what PHP version, and googling for "87a237342" turns up empty,
and I miss SVN version numbers)
For reference: php_version.hhttp://git.php.net/?p=php-src.git;a=blob;f=main/php_version.h;h=bfa499ac7f1daa2292c75cdc1398950b43d35174;hb=87a237342282fe036bb90486fdd6cdc392e16ac7
in commit 87a237342282fe036bb90486fdd6cdc392e16ac7http://git.php.net/?p=php-src.git;a=commit;h=87a237342282fe036bb90486fdd6cdc392e16ac7
lists
the version as 5.3.99-dev. The commit adds
PCRE_UCPhttp://git.php.net/?p=php-src.git;a=commitdiff;h=87a237342282fe036bb90486fdd6cdc392e16ac7;hp=00f75c79ca9318cbd57590b4c01144369612b3c2when
defined and the "u" modifier is used. The commit message is:
- Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8)
In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
characters, even in UTF-8 mode. However, this can be changed by
setting
the PCRE_UCP option.
The PHP changelog lists version 5.3.4
http://php.net/ChangeLog-5.php#5.3.4as containing the fix for bug
#52971 https://bugs.php.net/bug.php?id=52971.
Hi guys,
2012/11/6 Philip Olson philip@roshambo.org
On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.-Rasmus
I will report the bug to the package maintainers (remi, debian too...).
Is there anyway for us to avoid those "wrong" builds ?
I don't see how.
Hi geeks,
Does anyone have a suggestion on how the documentation should be
updated? The quote is from here:http://php.net/manual/en/regexp.reference.character-classes.php
With the quote being:
"In UTF-8 mode, characters with values greater than 128 do
not match any of the POSIX character classes."A few simple/related facts:
- PCRE_UCP exists as of PCRE 8.10
- Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure
what PHP version, and googling for "87a237342" turns up empty,
and I miss SVN version numbers)Anyway, how should this be documented?
Regards,
Philip
I added PCRE_UCP on PHP 5.3.4 as a fix for bug #52971. [1]
For documentation just say something like:
"In unicode mode the unicode properties are used instead to classify
characters of some classes."
More information extracted from PCRE documentation [2]:
---------8<-------------------------------
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
Negated versions, such as [:^alpha:] use \P instead of \p.
The other POSIX classes are unchanged, and match only
characters with code points less than 128.
[1] - http://svn.php.net/viewvc/?view=revision&revision=303963
[2] - http://pcre.org/man.txt
--
Regards,
Felipe Pena
hi,
On Mon, Nov 5, 2012 at 10:57 AM, Jean-Sébastien Hedde
jeanseb@au-fil-du.net wrote:
Hi,
I'm facing an issue with preg_match and an UTF8 string.
The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : RégisIf I read the manual preg_match should return 0 ("In UTF-8 mode, characters
with values greater than 128 do not match any of the POSIX character
classes.") but I've got 1 in some case :On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)How this can be possible ?
I would try using the bundled PCRE instead. As far as I remember,
almost all distro uses the system PCRE and not always build with UTF-8
support.
Cheers,
Pierre
@pierrejoye
On Mon, Nov 5, 2012 at 10:57 AM, Jean-Sébastien Hedde
jeanseb@au-fil-du.net wrote:Hi,
I'm facing an issue with preg_match and an UTF8 string.
The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : RégisIf I read the manual preg_match should return 0 ("In UTF-8 mode, characters
with values greater than 128 do not match any of the POSIX character
classes.") but I've got 1 in some case :On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1
builtin pcre (btw, forget 5.2, go with at least 5.3).
On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
system's pcre
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1
builtin pcre
On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)
system's pcre
Cheers,
Pierre
@pierrejoye
On Tue, 6 Nov 2012 11:17:34 +0100, Pierre Joye pierre.php@gmail.com
wrote:
I would try using the bundled PCRE instead. As far as I remember,
almost all distro uses the system PCRE and not always build with UTF-8
support.
Hi,
I come to this conclusion too but I don't see what is missing in system
PCRE :
pcretest -C
PCRE version 8.02 2010-03-19
Compiled with
UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
Regards,
Jean-Sébastien Hedde
au-fil-du.net
Le 06/11/2012 11:17, Pierre Joye a écrit :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
I would try using the bundled PCRE instead. As far as I remember,
almost all distro uses the system PCRE and not always build with UTF-8
All my build use bundled pcre library when system one is < 8.10 for
months...
Output of 5.2.10 seems like provocation... ;)
Remi
P.S. Remi, not Rémi ;)