Incomprehension with preg_match and utf8

12 years ago by jeanseb@au-fil-du.net — view source

unread

Hi,

I'm facing an issue with preg_match and an UTF8 string.

The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : Régis

If I read the manual preg_match should return 0 ("In UTF-8 mode,
characters with values greater than 128 do not match any of the POSIX
character classes.") but I've got 1 in some case :

On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1

On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1

On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)

How this can be possible ?

Regards,

Jean-Sébastien Hedde
au-fil-du.net

12 years ago by Gustavo Lopes — view source

unread

Em 2012-11-05 10:57, Jean-Sébastien Hedde escreveu:

I'm facing an issue with preg_match and an UTF8 string.

The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : Régis

If I read the manual preg_match should return 0 ("In UTF-8 mode,
characters with values greater than 128 do not match any of the POSIX
character classes.") but I've got 1 in some case :

The documentation is simply out-of-date. We have set PCRE_UCP if the
'u' modifier is present for some time now (since 87a237342, 3 Oct 2010).

Look for PCRE_UCP in http://www.pcre.org/pcre.txt to know the
implications.

--
Gustavo Lopes

12 years ago by Rasmus Lerdorf — view source

unread

Hi,

I'm facing an issue with preg_match and an UTF8 string.

The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : Régis

If I read the manual preg_match should return 0 ("In UTF-8 mode,
characters with values greater than 128 do not match any of the POSIX
character classes.") but I've got 1 in some case :

On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1

On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1

On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)

How this can be possible ?

I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.

-Rasmus

12 years ago by jeanseb@au-fil-du.net — view source

unread

On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:

I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.

-Rasmus

I will report the bug to the package maintainers (remi, debian too...).

Is there anyway for us to avoid those "wrong" builds ?

Regards,

Jean-Sébastien Hedde
au-fil-du.net

12 years ago by Rasmus Lerdorf — view source

unread

On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:

I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.

-Rasmus

I will report the bug to the package maintainers (remi, debian too...).

Is there anyway for us to avoid those "wrong" builds ?

I don't see how.

12 years ago by Philip Olson — view source

unread

On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:

I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.

-Rasmus

I will report the bug to the package maintainers (remi, debian too...).

Is there anyway for us to avoid those "wrong" builds ?

I don't see how.

Hi geeks,

Does anyone have a suggestion on how the documentation should be
updated? The quote is from here:

http://php.net/manual/en/regexp.reference.character-classes.php

With the quote being:

"In UTF-8 mode, characters with values greater than 128 do
not match any of the POSIX character classes."

A few simple/related facts:

PCRE_UCP exists as of PCRE 8.10
Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure
what PHP version, and googling for "87a237342" turns up empty,
and I miss SVN version numbers)

Anyway, how should this be documented?

Regards,
Philip

12 years ago by Galen Wright-Watson — view source

unread

[...]
A few simple/related facts:

[...]

Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure
what PHP version, and googling for "87a237342" turns up empty,
and I miss SVN version numbers)

For reference: php_version.hhttp://git.php.net/?p=php-src.git;a=blob;f=main/php_version.h;h=bfa499ac7f1daa2292c75cdc1398950b43d35174;hb=87a237342282fe036bb90486fdd6cdc392e16ac7
in commit 87a237342282fe036bb90486fdd6cdc392e16ac7http://git.php.net/?p=php-src.git;a=commit;h=87a237342282fe036bb90486fdd6cdc392e16ac7
lists
the version as 5.3.99-dev. The commit adds
PCRE_UCPhttp://git.php.net/?p=php-src.git;a=commitdiff;h=87a237342282fe036bb90486fdd6cdc392e16ac7;hp=00f75c79ca9318cbd57590b4c01144369612b3c2when
defined and the "u" modifier is used. The commit message is:

Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8)

In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII

characters, even in UTF-8 mode. However, this can be changed by

setting

the PCRE_UCP option.

The PHP changelog lists version 5.3.4
http://php.net/ChangeLog-5.php#5.3.4as containing the fix for bug
#52971 https://bugs.php.net/bug.php?id=52971.

12 years ago by Felipe Pena — view source

unread

Hi guys,

2012/11/6 Philip Olson philip@roshambo.org

On Mon, 05 Nov 2012 08:04:06 -0800, Rasmus Lerdorf rasmus@lerdorf.com
wrote:

I think the documentation is wrong on that. In Unicode mode [[:alnum:]]
actually becomes \p{Xan} which should match Unicode chars as well, but
only if PCRE was compiled with Unicode support. So I suspect you don't
actually have a Unicode-capable PCRE build in some cases there.

-Rasmus

I will report the bug to the package maintainers (remi, debian too...).

Is there anyway for us to avoid those "wrong" builds ?

I don't see how.

Hi geeks,

Does anyone have a suggestion on how the documentation should be
updated? The quote is from here:

http://php.net/manual/en/regexp.reference.character-classes.php

With the quote being:

"In UTF-8 mode, characters with values greater than 128 do
not match any of the POSIX character classes."

A few simple/related facts:

PCRE_UCP exists as of PCRE 8.10

Gustavo mentioned the related PHP change on Oct 3, 2010 (not sure
what PHP version, and googling for "87a237342" turns up empty,
and I miss SVN version numbers)

Anyway, how should this be documented?

Regards,
Philip

I added PCRE_UCP on PHP 5.3.4 as a fix for bug #52971. [1]

For documentation just say something like:

"In unicode mode the unicode properties are used instead to classify
characters of some classes."

More information extracted from PCRE documentation [2]:

---------8<-------------------------------

[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}

Negated versions, such as [:^alpha:] use \P instead of \p.
The other POSIX classes are unchanged, and match only
characters with code points less than 128.

[1] - http://svn.php.net/viewvc/?view=revision&revision=303963
[2] - http://pcre.org/man.txt

--
Regards,
Felipe Pena

12 years ago by Pierre Joye — view source

unread

hi,

On Mon, Nov 5, 2012 at 10:57 AM, Jean-Sébastien Hedde
jeanseb@au-fil-du.net wrote:

Hi,

I'm facing an issue with preg_match and an UTF8 string.

The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : Régis

If I read the manual preg_match should return 0 ("In UTF-8 mode, characters
with values greater than 128 do not match any of the POSIX character
classes.") but I've got 1 in some case :

On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1

On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0
php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1

On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)

How this can be possible ?

I would try using the bundled PCRE instead. As far as I remember,
almost all distro uses the system PCRE and not always build with UTF-8
support.

Cheers,

Pierre

@pierrejoye

12 years ago by Pierre Joye — view source

unread

On Mon, Nov 5, 2012 at 10:57 AM, Jean-Sébastien Hedde
jeanseb@au-fil-du.net wrote:

Hi,

I'm facing an issue with preg_match and an UTF8 string.

The pattern is : /^[[:alnum:]\s-'%]+$/u
The string : Régis

If I read the manual preg_match should return 0 ("In UTF-8 mode, characters
with values greater than 128 do not match any of the POSIX character
classes.") but I've got 1 in some case :

On a Windows host
php 5.2.12 - (PCRE 7.9 2009-04-11) : preg_match === 1

builtin pcre (btw, forget 5.2, go with at least 5.3).

On the same centos host :
php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0

system's pcre

php 5.4.8 (my build) - (PCRE 8.12 2011-01-15) : preg_match === 1

builtin pcre

On an other Centos host :
php 5.4.0 (Rémi's RPM) - (PCRE 7.8 2008-09-05)

system's pcre

Cheers,

Pierre

@pierrejoye

12 years ago by jeanseb@au-fil-du.net — view source

unread

On Tue, 6 Nov 2012 11:17:34 +0100, Pierre Joye pierre.php@gmail.com
wrote:

I would try using the bundled PCRE instead. As far as I remember,
almost all distro uses the system PCRE and not always build with UTF-8
support.

Hi,

I come to this conclusion too but I don't see what is missing in system
PCRE :

pcretest -C
PCRE version 8.02 2010-03-19
Compiled with
UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack

Regards,

Jean-Sébastien Hedde
au-fil-du.net

12 years ago by Remi Collet — view source

unread

Le 06/11/2012 11:17, Pierre Joye a écrit :

php 5.2.10 (Rémi's RPM) - (PCRE 6.6 06-Feb-2006) : preg_match === 0

I would try using the bundled PCRE instead. As far as I remember,
almost all distro uses the system PCRE and not always build with UTF-8

All my build use bundled pcre library when system one is < 8.10 for
months...

Output of 5.2.10 seems like provocation... ;)

Remi

P.S. Remi, not Rémi ;)