Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:111983 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 73173 invoked from network); 2 Oct 2020 10:49:27 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 2 Oct 2020 10:49:27 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 5C423180087 for ; Fri, 2 Oct 2020 03:01:47 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,URIBL_SBL, URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mx101.easyname.com (mx101.easyname.com [217.74.15.6]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 2 Oct 2020 03:01:46 -0700 (PDT) Received: from 91-113-56-40.adsl.highway.telekom.at ([91.113.56.40] helo=[10.0.0.1]) by mx.easyname.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1kOHsl-00083H-7Z for internals@lists.php.net; Fri, 02 Oct 2020 10:01:45 +0000 To: internals@lists.php.net Autocrypt: addr=thomas@landauer.at; keydata= xsFNBFz5ALABEADT2RIgoEHb5ARAHu96/NSlJn4lyhA6Z6fjC211wHwERuntoLIq4Hk+ioZ4 PrpuOSsZKC7OovF61qG2vlmr2Tg43lyb2j+4ZJlX3ELI2cR7bzqAVknKE127ZPZ0bK3sGImT BDWEO7jiUgosfZh9kQ0oSGQwdwnHFaX4iGUgOd5+rejtPm6RoPuczjaz7saAwcaIfj5Q1wHO dilfvtdXdLm8ziliODXWJRJDiJBJ/CHLRHP3LX5kiYc9LQDlm+VL9vtosyz5gQWxtgjCC9QB yOeKWoAPw1dzcDAiVeGhAVAEn0vOxVpwLAprwXfbUyI/oDdHlXTZ0lGTqSEZawQt5DXyEbd+ 7IdaxqNYam0GUdMfC9go10Kyei0nmb9lahxOH08aO9KTCLqs8v1me3rWZ/EF8+V/dbbnCiYO RUL4lJXp8okEZ4A6axFdyJ18WNWjQdaLTuPAPRV5H1hPQFqJ0BxPCQa3FPhlgH8pkmeorOTg 0jCOwJMXHTR44GQOYZO1WtBCMDN8dbdLXEpwiyIBmpVFHor3ltnyavQ73kHM4kbGFcoIINl6 pUBQucHFJZMe4SvUHQ9yHs8UpDiwsrRJ0Yj779hqzpYLj1bpOB9erQRa+uRoeUSCfvO6aAaR lGXzrr0u5N9d4Q5SvDIavZy5+nlXyhgvEApcFCqDFgWxN9hjWQARAQABzSRUaG9tYXMgTGFu ZGF1ZXIgPHRob21hc0BsYW5kYXVlci5hdD7CwaUEEwEIADgWIQT97vqDdnDkfjVMMNhjhey3 YrVn8QUCXPkAsAIbIwULCQgHAgYVCgkICwIEFgIDAQIeAQIXgAAhCRBjhey3YrVn8RYhBP3u +oN2cOR+NUww2GOF7LditWfxzpcP/js1S/M3D/EojVAcw7ta0wTRv+HvdLKSi4hZzcnDJ8MZ U4NMxhsej01hipoYfdeK3ZsKWZyJip1wD2RVIIooGQb64KS6166sxcDHFMNR2LQet96uAtak Iym//m82ZFKcYGDSSQ6rrNMYMqEgPjEubL0kcE6EoQsHxU0j6FD8TLIQNJbLhzAm2EUwQNZk XIh7j5jJOuqnhB4PpPltDua5mSTxjiVjUSLBttmfk4TxO5SR3jzqYYrle7aQE876zHaARHsR N5/N6z9PcKxOKeBxsBWQAoW1gcaA2N7ckFUvoO/PHfXy+YL/AZTQkmxUDBbaPYJRRLiYlZqo yqT5omTc6GgP6mzPn6V9leBInMcWPeHmYjeFs90v3DW32jtVcYyCD+tKlxJmge1fSUFin9uA BaXOUH+pqlaZmXkdiT9lT8BCYQ8mDs/mNbTTIBwEhKW5CtEKZeVzvgZIm70oEXS8DoTJTJIx fAsWVsHKbCj49mwO/zN3voRR5BdNBG6C/H/rFKMcEMPpqX3A9AG2uBTpPTgIxFPpf2+iOllh t0jHI7i0cLmnKG/4V3iWT7eIgcFmnICbV5ugXzagZXSi0CKp4KCnmaMyf+AMeQUp+CziMH9r oQYeF9J5YkCwl2WRvwiL2npjp6DO90G7Dql4AbpN0CvFFcdeRcHzgorciJZkhW4PzsFNBFz5 ALABEADC5CmSGzmas4328mM7AilLW94NWuu8UsNLGqkE3eQf+g+P65ECQis7ZAH+rtH+ZW5G tMnBqbZUX5V6lMrVNEMZIwsoXupJELZqo1Dug0wBYfFEmMJ60PQD1Vf5NrfKw8Tvkn1IKUZK RtJZO7kkcppYfYN+M5oDUBBmkVjnYGlFhk3JTyPQdiaLnJIaVZ9w5TAZJN2GnNE2tQaNNWTk dpjl/Flf2oq1ieawu6f5BFZue57xj9y4xGIKVfoTX3SHfH0vUoslpgmK4uVLmway/Zm86Rqb QOCxWecN3I93YfCwGdRIAkTxluSBVz+8573K0s+Oehw0OzshemwlUnMItoHP6VmWtHGbmfm3 H+40XDEkOo4Zm012NB5q1fcsw8yX/nRYsaWtaQ1vfLzwqaxR0JSROfNo1fn5/EWz8WzmQNtb ftLNwgHDacYsCRKQmuXhdpy3OW1H9eE+plwPS7VdrC20lLFYaJSsyXh1npdxiMV8Ur507Pxh tsSMeaNROQs5KQvg9ccsOAhxWMUK5CCdrFsrI8nr9wYeO3a3Ue/pZltL+xnYdyl5Jpf/+lRJ hRaSCoOveWaSkNG+0s8gp3qTRRPPSnHdZBaA9UT5lq+FAyQVlle67emmyoXsbOdwm++M8xgk um/zQkvc4wJPU6EB2jn+QZ6XzNVcWomT9xJsYGH6twARAQABwsGNBBgBCAAgFiEE/e76g3Zw 5H41TDDYY4Xst2K1Z/EFAlz5ALACGwwAIQkQY4Xst2K1Z/EWIQT97vqDdnDkfjVMMNhjhey3 YrVn8ZrCD/95nLMyCYjb0lx2cnxFXeLuM/+j0mEe2uIL+VajRkNF0JXOELaZJa0ouk2mRVUg Lp01eueCC7WHQsvsg7I/W/GeyiwgZQHDx6IPZw8FiMZiV/x9Jn8mzrMgi0mbRWoy8RM0WQkf 13czOGAhR19hNUEw/+N24my7kUIYOm1VO8/y8sndbZmFniTWLQ6kiNzKjbzqnL7Pnww51cRj /p2QuC1xGS39roOj4xoCINC4syFRB31x5cSx/YQCwpnSyDOpBAS6iLXVq204x0aNDDfUw0rK hDeEWaP77R38Q4e+L2rqbHmIIKwPMG2yVaXaaFwUHhtnS66p0eWu6l4jS42hLFuWjxyOKpxV dXKTCnJpic9nYGR+x1Dl0E/HYqCdw1DM6g4SiBpiNNWYgV+JC21+VnGWkwdXSs+cUBYVLtpc 5UiqCrSyHM4+VlrdhegrNZJd893PmgiV65fJBdFTn2VLKLHHMNhUaOMRegOeGyleyKLHi1Po TTS5kKuBdfNXmdnuo964Q8s9YXBGznCJAVbAD0NMrQ2f1Iz7NinSMgyB9v1GpWMKjqlDS30R PrZ+Pq9y73GFLnzlNaq7KULvaZTAavN5fiFwGJq2UMTUNT8CN1bwBmyBCndg2TfPi/yYBeSv gQG8M6wpFLJ7dx21h/9O/LUWsJUGO3jHGTn+2mQlWnCuug== Message-ID: Date: Fri, 2 Oct 2020 12:01:41 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Suggestion: Make all PCRE functions return *character* offsets, rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given From: thomas@landauer.at (Thomas Landauer) Hi, this is a follow-up of a bug I opened, and cmb suggested to continue here: https://bugs.php.net/bug.php?id=80166 Advantages: 1: Easier string manipulation: If somebody does (as in my case) `preg_match_all()` with PREG_OFFSET_CAPTURE, what will they probably use those returned numbers/offsets for? My answer: For *splitting the string* - in some way or the other. Now, with byte offsets, I can't do such basic things as just `+1` to get to the next character. Or extract exactly 3 characters. 2: Better performance: This may sound odd, since cmb said the exact opposite ;-) (sequential access vs. random access). However, if I need character offsets (see 1), what can I do? I'm forced to use some workaround on top - as e.g. https://www.php.net/manual/en/function.preg-match-all.php#71572 - which is certainly way slower than any native implementation. 3: Consistency with users' expectations: The current behavior is causing confusion and is perceived as counter-intuitive, see https://www.php.net/manual/en/function.preg-match-all.php#61426 and https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php So I'm suggesting: * Either do the BC break, and just return byte offsets if the modifier `u` is given. * Or create *new* functions for it: `mb_preg_match_all()` etc. -- Cheers, Thomas