Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:51700 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 22861 invoked from network); 15 Mar 2011 12:56:01 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 15 Mar 2011 12:56:01 -0000 Authentication-Results: pb1.pair.com smtp.mail=rquadling@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=rquadling@gmail.com; sender-id=pass; domainkeys=bad Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.42 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: rquadling@gmail.com X-Host-Fingerprint: 209.85.216.42 mail-qw0-f42.google.com Received: from [209.85.216.42] ([209.85.216.42:45365] helo=mail-qw0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id F2/83-03183-0E16F7D4 for ; Tue, 15 Mar 2011 07:56:01 -0500 Received: by qwi4 with SMTP id 4so385380qwi.29 for ; Tue, 15 Mar 2011 05:55:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:in-reply-to:references :from:date:message-id:subject:to:cc:content-type :content-transfer-encoding; bh=zW2Hy/KCTHM45Tv1M30pr9QegNFouv1mQ9FZygDgZUk=; b=H49pCBnZyF0LVirXvQOJWvuJPj67zPmNGufcXCReNJhqa5IEeIYf+OOTB8n8rCt+kF NNJV7MKXFyc047NoWyF970MOokRAN3nqX8U8gtTJSFG8UVSw65f51ySzJXvUJrrBQl/M mbLEYPd5tWTMxZLt/ujMRSX67F3XsH7/1pWpY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:from:date:message-id :subject:to:cc:content-type:content-transfer-encoding; b=whMJl2Xkib5owVHPu3ugLqfjzPETY78GKd2JI/aBPYHGhKA9lBQpeU0INvUQCTyrl9 KEfL6oq++guK3ZX4q5QS7OqoaAEEqjbbF5zCOD0CO97Fe1qcjE3bGt5vjGzmaJWtiJaN GAm12uIgjIwlaW1Njl6wchgDMKiQqDdSInTK8= Received: by 10.229.78.228 with SMTP id m36mr1527591qck.109.1300193758117; Tue, 15 Mar 2011 05:55:58 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.40.147 with HTTP; Tue, 15 Mar 2011 05:55:38 -0700 (PDT) Reply-To: RQuadling@googlemail.com In-Reply-To: <4D7F5E96.8040507@yahoo.com.au> References: <4D7F5E96.8040507@yahoo.com.au> Date: Tue, 15 Mar 2011 12:55:38 +0000 Message-ID: To: Ben Schmidt Cc: internals@lists.php.net Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] preg_replace does not replace all occurrences From: rquadling@gmail.com (Richard Quadling) On 15 March 2011 12:41, Ben Schmidt wrote: >>>>> =C2=A0 =C2=A0static $re =3D '/(^|[^\\\\])\'/'; >>> >>> Did no one see why the regex was wrong? > > I saw what the regex was. I didn't think like you that it was 'wrong'. > > Once you unescape the characters in the PHP single-quoted string above > (where two backslashes count as one, and backslash-quote counts as a > quote), the actual pattern that reaches the preg_replace function is: > > =C2=A0 /(^|[^\\])'/ > >>> RegexBuddy (a windows app) explains regexes VERY VERY well. > > What kind of patterns? Does it support PCRE ones? > Yep and MANY other flavours (C#, C++, Dephi, Groovy, Java, Javascript, MySQL, ...) >> The important bit (where the problem lies with regard to the regex) is >> ... >> >> Match a single character NOT present in the list below =C2=AB[^\\\\]=C2= =BB >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 A \ character =C2=AB\\=C2=BB >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 A \ character =C2=AB\\=C2=BB > > This is not the case. > > 1. As above, the pattern reaching preg_replace is /(^|[^\\])'/ > > 2. PCRE, unlike many other regular expression implementations, allows > backslash-escaping inside character classes (square brackets). So the > doubled backslash only actually counts as a single backslash character > to be excluded from the set of characters the atom will match. > > There is no error here. (And even if there were two backslashes being > excluded, of course, it wouldn't hurt anything or change the meaning of > the pattern.) > >> The issue is the word _single_. > > I don't think anybody thought otherwise. > > The problem was that, to a casual observer, the pattern seems to mean "a > quote which doesn't already have a backslash before it". I believe this > was its intent. (And the replacement added the 'missing' backslash.) > > But the pattern doesn't mean that. It actually means "a character which > isn't a backslash, followed by a quote". This is subtly different. > > And it's most noticeable when two quotes follow each other in the > subject string. In > > =C2=A0 str''str > > first the pattern matches "r'" (non-backslash followed by quote), and > then it keeps searching from that point, i.e. it searches "'str". Since > this isn't the beginning of the string, and there is no quote following > a non-backslash character, there are no further matches. > > Now, here is a pattern which actually means "a quote which doesn't > already have a backslash before it" which is achieved by means of a > lookbehind assertion, which, even when searching the string after the > first match, "'str", still 'looks back' on the earlier part of the > string to recognise the second quote is not preceded by a backslash and > match a second time: > > =C2=A0 /(^|(? > As a PHP single-quoted string this is: > > =C2=A0 '/(^|(? > Hope this helps, > > Ben. > > > > If I say ... I get ... /(^|[^\\])'/ which is explained as ... (^|[^\\])' Options: case insensitive; ^ and $ match at line breaks Match the regular expression below and capture its match into backreference number 1 =C2=AB(^|[^\\])=C2=BB Match either the regular expression below (attempting the next alternative only if this one fails) =C2=AB^=C2=BB Assert position at the beginning of a line (at beginning of the string or after a line break character) =C2=AB^=C2=BB Or match regular expression number 2 below (the entire group fails if this one fails to match) =C2=AB[^\\]=C2=BB Match any character that is NOT a \ character =C2=AB[^\\]=C2=BB Match the character =E2=80=9C'=E2=80=9D literally =C2=AB'=C2=BB And that certainly makes a LOT more sense. Decoding regexes and handling the escaping needed for the language is a real headache sometimes. Just imagine creating regex code for use by client side Javascript using PH= P. 8 \ in a row for a single \ wouldn't be impossible. Sorry for the confusion. --=20 Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY