Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:79490 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 55866 invoked from network); 9 Dec 2014 07:07:35 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 9 Dec 2014 07:07:35 -0000 Authentication-Results: pb1.pair.com smtp.mail=xmilky@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=xmilky@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.217.174 as permitted sender) X-PHP-List-Original-Sender: xmilky@gmail.com X-Host-Fingerprint: 209.85.217.174 mail-lb0-f174.google.com Received: from [209.85.217.174] ([209.85.217.174:48363] helo=mail-lb0-f174.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id DD/B1-39368-5BF96845 for ; Tue, 09 Dec 2014 02:07:33 -0500 Received: by mail-lb0-f174.google.com with SMTP id 10so3408039lbg.19 for ; Mon, 08 Dec 2014 23:07:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=YSels5XhJ714TNHgvmUdAZmNzPpJ+w/MQb2wzEmPVmM=; b=kzwzB+nOfHsqKjIErNevZL/ACrSQYmHo5Sc/NhQHyNcU2EMGOzD42qt63+xbiyFN68 +0B/wBalKSUmOW0pN1EhXzXyL5S+0gUccPS88xuDptRva1XGrPXbrYPwcW+B9jrjR3dP ZAJLi201HwyMDmOxldskXMLfXznakpaR0//mtEXMTFrepZV4JyQljQ5UuqtR/OzcfJNt gS5K0F1VPaMHmRCm6l2CE7lN5pbabey+I1uS6Lzv7P7RLCjojWla0igemYu6npuIy8aD 66XPokfITVcxLQYiIkJkQ2Otr87c5FgOMS4Z8/0ot2kjWamy4D6tGYW02xlEJEzS52G8 6K6g== MIME-Version: 1.0 X-Received: by 10.152.2.41 with SMTP id 9mr19741901lar.47.1418108849755; Mon, 08 Dec 2014 23:07:29 -0800 (PST) Sender: xmilky@gmail.com Received: by 10.25.33.17 with HTTP; Mon, 8 Dec 2014 23:07:29 -0800 (PST) In-Reply-To: <10EE9A5B-1711-455A-AB6A-6E7EA858D081@ajf.me> References: <10EE9A5B-1711-455A-AB6A-6E7EA858D081@ajf.me> Date: Tue, 9 Dec 2014 08:07:29 +0100 X-Google-Sender-Auth: bcpfIU3Dzj1dCf4ROS6pQqbzFic Message-ID: To: Andrea Faulds Cc: internals Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] [VOTE][RFC] Unicode Codepoint Escape Syntax From: mario@include-once.org Tue, 9 Dec 2014 02:44:33 +0000 Andrea Faulds : > > Well, PCRE does what it does probably because of its name: > *Perl-Compatible* Regular Expressions. Perl has the \x syntax. But > PCRE=E2=80=99s syntax comes from what suits Perl, not PHP, so I don=E2=80= =99t see why > we should necessarily match its behaviour. If we add \x{xxxxx} syntax > to PHP=E2=80=99s string literals, then we=E2=80=99ll break existing code = which uses > double quoted strings for regular expressions. Actually the opposite seems alarming. For double quoted strings it'd be irrelevant if \u{} or \x{} was priorly handled by PHP or left to PCREs interpretation. (Having an alternative there is even beneficial.) However, single-quoted strings are more commonly and habitually used for regexps. And with \u{} going to be used regularily, then unknowingly or accidentially in regex context, is where it would trigger PCRE failures. preg_match('~\u{bad}~umixUs') Both \u{} and \x{} are used in fringe cases only of course. Consistently settling on one would still benefit forward compatibility here. > > I think \x{xxxx} is misleading anyway - \xXX is always > single-byte/character, yet Unicode code points can=E2=80=99t be represent= ed > in PHP strings as single bytes when encoded in UTF-8 (unless they=E2=80= =99re > below U+0100, of course). If I saw "\x{abcd}=E2=80=9D I'd expect it to be= the > same as "\xab\xbc=E2=80=9D. Plus, while Perl has \x{xxxx} syntax, Ruby an= d > ECMAScript 6 have the \u{xxxx} syntax, so \u{xxxx} is already more > popular. The =E2=80=98u=E2=80=99 in \u{xxxx} also makes it more obviously= =E2=80=9CUnicode=E2=80=9D. There's no question really about \u being more common and therefore recognize~ and preferrable. Taking the cue from Ruby is appreciated! Since the RFC rightly discounts the standard \uFFFF due to compatibility reasons, there's however little visual and semantic distinction between the {}-embellished variant \u{hhhhh} and a hypothetical \x{hhhhh}. Not sure why or who would misinterpret \x{abc} as multi-bytes, really. It's well understood and working for PCRE. The advantage of overloading \x is a much lessened likelihood to ever encounter a residual "\x{" in PHP strings. Whereas "\u" is new, and never had an implicit payload constraint, thus could run into a preexisting "\u{xxxx}" that was formerly targeted at a later/distinct context. Going with the Ruby theme; when piping a string there or receiving one it's irrelevant who uses which syntax to preinterpret it. It's only really interesting when exchanging string literals. But the RFC and the patch don't cover stripcslashes() or addcslashes() for instance. So there's no direct string syntax interoperability earmarked for. Which is why I brought forward \x{hhhhh} as alternative for within-PHP consistency at least. (Not bent on lobbying for x, as \u{=E2=80=A6} is visually more pleasing; ju= st unsure about its scope.) \u{1F44B}