Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:79490
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.217.174 as permitted sender)
MIME-Version: 1.0
Sender: xmilky@gmail.com
In-Reply-To: <10EE9A5B-1711-455A-AB6A-6E7EA858D081@ajf.me>
References: <E21EA75D-69AA-408C-88B7-C8022E9FB2C7@ajf.me>
	<CADG7izVM1qVFQC9wp4An=sTn-VB-PSkSmUgNUiOFRsm9uGDF6A@mail.gmail.com>
	<10EE9A5B-1711-455A-AB6A-6E7EA858D081@ajf.me>
Date: Tue, 9 Dec 2014 08:07:29 +0100
Message-ID: <CADG7izUy_j5NjvR4o1aK8o9uTFg80=FKDiR78pMEQXNssYMHug@mail.gmail.com>
To: Andrea Faulds <ajf@ajf.me>
Cc: internals <internals@lists.php.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [PHP-DEV] [VOTE][RFC] Unicode Codepoint Escape Syntax
From: mario@include-once.org

Tue, 9 Dec 2014 02:44:33 +0000 Andrea Faulds <ajf@ajf.me>:
>
> Well, PCRE does what it does probably because of its name:
> *Perl-Compatible* Regular Expressions. Perl has the \x syntax. But
> PCRE=E2=80=99s syntax comes from what suits Perl, not PHP, so I don=E2=80=
=99t see why
> we should necessarily match its behaviour. If we add \x{xxxxx} syntax
> to PHP=E2=80=99s string literals, then we=E2=80=99ll break existing code =
which uses
> double quoted strings for regular expressions.

Actually the opposite seems alarming. For double quoted strings it'd be
irrelevant if \u{} or \x{} was priorly handled by PHP or left to PCREs
interpretation. (Having an alternative there is even beneficial.)

However, single-quoted strings are more commonly and habitually used for
regexps. And with \u{} going to be used regularily, then unknowingly or
accidentially in regex context, is where it would trigger PCRE failures.

    preg_match('~\u{bad}~umixUs')

Both \u{} and \x{} are used in fringe cases only of course. Consistently
settling on one would still benefit forward compatibility here.

>
> I think \x{xxxx} is misleading anyway - \xXX is always
> single-byte/character, yet Unicode code points can=E2=80=99t be represent=
ed
> in PHP strings as single bytes when encoded in UTF-8 (unless they=E2=80=
=99re
> below U+0100, of course). If I saw "\x{abcd}=E2=80=9D I'd expect it to be=
 the
> same as "\xab\xbc=E2=80=9D. Plus, while Perl has \x{xxxx} syntax, Ruby an=
d
> ECMAScript 6 have the \u{xxxx} syntax, so \u{xxxx} is already more
> popular. The =E2=80=98u=E2=80=99 in \u{xxxx} also makes it more obviously=
 =E2=80=9CUnicode=E2=80=9D.

There's no question really about \u being more common and therefore
recognize~ and preferrable. Taking the cue from Ruby is appreciated!

Since the RFC rightly discounts the standard \uFFFF due to compatibility
reasons, there's however little visual and semantic distinction between
the {}-embellished variant \u{hhhhh} and a hypothetical \x{hhhhh}.

Not sure why or who would misinterpret \x{abc} as multi-bytes, really.
It's well understood and working for PCRE. The advantage of overloading
\x is a much lessened likelihood to ever encounter a residual "\x{" in
PHP strings.
Whereas "\u" is new, and never had an implicit payload constraint, thus
could run into a preexisting "\u{xxxx}" that was formerly targeted at a
later/distinct context.

Going with the Ruby theme; when piping a string there or receiving one
it's irrelevant who uses which syntax to preinterpret it. It's only
really interesting when exchanging string literals.
But the RFC and the patch don't cover stripcslashes() or addcslashes()
for instance.

So there's no direct string syntax interoperability earmarked for.
Which is why I brought forward \x{hhhhh} as alternative for within-PHP
consistency at least.
(Not bent on lobbying for x, as \u{=E2=80=A6} is visually more pleasing; ju=
st
unsure about its scope.)

\u{1F44B}