Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:111984 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 85739 invoked from network); 2 Oct 2020 13:13:05 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 2 Oct 2020 13:13:05 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id A31B318050B for ; Fri, 2 Oct 2020 05:25:25 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 2 Oct 2020 05:25:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1601641477; bh=aBlgWXNsVC2hyKNH4zrXv3Es4iJ0XLJQzuWes4XJX3o=; h=X-UI-Sender-Class:From:Subject:To:References:Date:In-Reply-To; b=XFTXbbj309/l/G3oBnsdPcJD8igf1P5rDS7pmJxA8S69uSeqTVDHqzMjYu9FRrLCl ad1kShW/Pw5WoZJoaBpMwibUSBsh955GzuGLmbrbxv9mBxBPo2DD937pQvFM+sA/ON TPbLltxbA4wwNeXE0HZI2z1tTgo06yEFR23m7Vfo= X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c Received: from [192.168.2.130] ([84.179.241.81]) by mail.gmx.com (mrgmx004 [212.227.17.190]) with ESMTPSA (Nemesis) id 1N1Obh-1kYjZH3Ttz-012lOa; Fri, 02 Oct 2020 14:24:36 +0200 To: Thomas Landauer , internals@lists.php.net References: X-Mozilla-News-Host: news://news.php.net Message-ID: <2c54d906-35c7-6ed6-ac47-649b7bcde2a2@gmx.de> Date: Fri, 2 Oct 2020 14:24:36 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.3.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: de-DE Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:+yKa1gsyPokTvbcZRMd29mOuu9SMDHWr2QWLB08G6kFMwRQxB8x 4KJvdafN24tnnpy2Bsk3Pkwg+c/iJKA8hzzWqeFP2B4Ov57VKxhttRRUGPRsgvATxzTk6I7 c9yhs3wxfMsXgZhrvREnSdiuwYzPFwlNr5EDJhfmBRoWSphrHjvRMX29tH7QTGEX/nTtGfp GHdx/Nq4/cnXAaNga32cA== X-UI-Out-Filterresults: notjunk:1;V03:K0:K05CxdMje8E=:72cp2lklkI0yjmRPjqvYmQ D1BwNvfG2MuugUNvUt6pK9pSTBMj6YEHMM9ZY/De1xcNnGLAbcZfPwzJm2EYMhWVQSCIDoWI0 9HRDvN6aQJeurMRjv+76JUFiCH2ghkxdRlY2QQU9wgXnzEC5JU6rg5/AJZcGBMjF/OTPg154d ofFhP50thdf8IRlZxgt9zi8TDowJFUjMuxcdRzJnPfBspXr+FWxmiaBk1x5bsX8VKVgfc8sxm RbfWZr+MD1ywyti/N0JsEqxu5sCoZ14qSz6PDy6Z2rW249Ur+mxERLld5nhpqXW0GKYEfZ0kz w6/lrMO5EKvLA5Z+hPpb13U785kNq+TAkoLi4Asv+CFv1o9yVvaq/AbRjbSwyWkFS5x1rb69I xK7W66QJasLZS22fKXjYFoUEKd0+JrjIVgs1LjLaoCnb1CUjyT8wXH2uubsZYEpdBp718IZmt twVHo4/u/1TbueJMX1B5nO1yn+NUlaF0wt7Jq+COjBX8U0Z4XiWaKQsyOs4ZetfVcfW+BFQfr gYgLypUuCG1hMp6DyvMyti+atXn7znV8tc6i2PR7RH8fgqNZxmowzuLRUkQhVEpQJuVCU3Hrb RDzHxetTps26jowxkV2FMu8UXhYgdkBRJ8zBw5Pf/6kZRdbb/jCMRUB1oFAIB8L/HItIxyBrb 7DY0pNB+QOw1kGWS8sf5zQJJJT8/bXOPOL3W+wiQ1UQm004vb9jprHfbsHG31f0hXHzY4hPYN qSEJgabl+8AlzD6Orh+nYk6haILkiB/6yv/wr/D04k99zq9/ZbZYka6IonHlFqhHpOGDMKauC EuDW67Vb4nwZ/4r7JHk69hu3Z9ApbL6aJnL7nP/IOAw1cJMmCQUpiKrt+tz8D4OLmy64u9DZG 20p6B3nPK/1SUDoO1Bxt+c8HrFdC5vJQ7UNyrBeawrkeFpZAPuBmDwkngC7yWadbuMduwOt9P 3iwjWU96KhrQw/hEqMZ24Meod6Va1E0uTJmxLRTduJ9jp3nQdNAb9kJF2TwWtRsliPzEb4GH5 IpYtBjCo3wBRTjseoYuYXOIlz9AWlr1xG5W31grgTDuGzbYJ0mDEYN5INQO8136R6G+jOfFzE iM+bj1QU2sgPNuSORp3xHzDlKArPv06UBIjUHQH6kE/NFOIpQ5vltbGagB4V/PVLollsycEXZ R6T0sza8fROdGXdpZXhvNzkmMhE4XqK9zc43zQcgnUx+tLNpElcaBGBYjkkjICHh6XYXmXgGN w9XGo/GwpI6uiullSK7qvmFdzzdQV3xYv6HVt0w== Subject: Re: Suggestion: Make all PCRE functions return *character* offsets,rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given From: cmbecker69@gmx.de ("Christoph M. Becker") On 10/2/2020 at 12:01 PM, Thomas Landauer wrote: > this is a follow-up of a bug I opened, and cmb suggested to continue > here: https://bugs.php.net/bug.php?id=3D80166 Indeed, thanks! > Advantages: > > 1: Easier string manipulation: > If somebody does (as in my case) `preg_match_all()` with > PREG_OFFSET_CAPTURE, what will they probably use those returned > numbers/offsets for? > My answer: For *splitting the string* - in some way or the other. Now, > with byte offsets, I can't do such basic things as just `+1` to get to > the next character. Or extract exactly 3 characters. The term "character" is ambiguous wrt. Unicode. The mbstring functions work on Unicode code points, so it's probably better to use that term instead. While it is trivial to get the next code point using index+1, this is not necessarily the next character, as perceived by a human. Using mb_substr(), you may even break "characters", e.g. . > 2: Better performance: > This may sound odd, since cmb said the exact opposite ;-) (sequential > access vs. random access). However, if I need character offsets (see 1), > what can I do? I'm forced to use some workaround on top - as e.g. > https://www.php.net/manual/en/function.preg-match-all.php#71572 - which > is certainly way slower than any native implementation. If mbstring functions are used to find some offset, they always have to traverse the string from the beginning, even if you are just interested in the last code point of a long string. If you have byte offsets, that code point can be accessed directly. Of course, that may not suit any possible scenario, but I still don't think that the PCRE functions should deal with code point offset instead of byte offsets. Regards, Christoph