Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:111990 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 23642 invoked from network); 2 Oct 2020 20:59:26 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 2 Oct 2020 20:59:26 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id ABACC1804AA for ; Fri, 2 Oct 2020 13:11:53 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 2 Oct 2020 13:11:50 -0700 (PDT) Received: by mail-ej1-f52.google.com with SMTP id q13so3577646ejo.9 for ; Fri, 02 Oct 2020 13:11:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=zEN79rlXWXUxKnrDXedqyQaGinyZWxsYgJ31kw8lIw4=; b=cIY4CHfAHg3OHLrbjWnITgquJl2eiehZ0SkuJCRQw/LTSR8ynfsLjxZ9+wVWs8sd5/ qr8UHjZhKNPil8gnFemGwptbJBD8IWh0yRVbksHM9ImPbURdqY+PVAsnedPvOLrFvx/B 0uWU/98bIeeTJjfaMVD17yVHtVuwEfdVRN6VU15CgM4CsAB5vqhk6ElT5Lhf8vrEt/Yf efxojV8ruY4aABrJen5OZRjgGvsoruneipyE8+lcsD6rhdMaexruw0pDo28VRDpCoJZN 9D0tKxPhF6Im6SrBFr6OnYRc9Yg79ALEThGTiW4Rg8u8BS4z1IS+bLoR3PvaMaf5VZQf YhLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=zEN79rlXWXUxKnrDXedqyQaGinyZWxsYgJ31kw8lIw4=; b=T4ZQ9eU737df6QllAWqx8AKd4zTBDMr50qt3D4Xv84JNRrLuooVIlO0EKn0rMZg/Rj wqegQt5R1KlQR3frPlOxko4Zt4NCfLVeFkImEat7XzIwboDSYfuCQLZfvs5XVdAvf7Kk YRcM0RdeIgWGxBUIADBzlQSC1wSyroLlwzpQOIY1dV+9U0vS71mwDX9lajMbeMPkb53d xtdIfZ3SHtBwM4lJwDBTLva68LUw6iWpKsbOar8wPRP3N51IYRPrBLyGOqsdc9B/ZJPT qE7qVtl7YsJU1BSXT4M5UKZVeZ/jFXol0LBq8MgHAvpTDlC755A1hOgO3jg3UDbY1Tkz 4gsQ== X-Gm-Message-State: AOAM530ZnFJpYMbA77TDvJ31IXPA4SYVHwm3ECXRsRkO1ERU0dyOcBfA cQsjXAh5h9m7HMr3m5ihFpjuoADQ0qsT/A== X-Google-Smtp-Source: ABdhPJy4j+E5RNFvUJK+jNkeYkR1x8qCzV7HXiwn2Gc1bXZNmHqladKf5KIQlONsVM8Eu93DWtXJNA== X-Received: by 2002:a17:906:c957:: with SMTP id fw23mr3845120ejb.510.1601669508010; Fri, 02 Oct 2020 13:11:48 -0700 (PDT) Received: from claude.fritz.box ([89.249.45.14]) by smtp.gmail.com with ESMTPSA id r27sm2028996edx.33.2020.10.02.13.11.46 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Oct 2020 13:11:46 -0700 (PDT) Message-ID: Content-Type: multipart/alternative; boundary="Apple-Mail=_FFB96616-EC34-4862-8B60-BB2BF0D64A48" Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) Date: Fri, 2 Oct 2020 22:11:46 +0200 In-Reply-To: Cc: internals@lists.php.net To: Thomas Landauer References: X-Mailer: Apple Mail (2.3608.120.23.2.4) Subject: Re: [PHP-DEV] Suggestion: Make all PCRE functions return *character* offsets, rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given From: claude.pache@gmail.com (Claude Pache) --Apple-Mail=_FFB96616-EC34-4862-8B60-BB2BF0D64A48 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi, Working with UTF-8-encoded strings does not implies working with = mb_string functions or with code-point counts. Personnally, I work with = standard string functions, plus [Grapheme functions] = (https://www.php.net/manual/en/ref.intl.grapheme.php = ) when I need to = split my string between =E2=80=9Ccharacters=E2=80=9D (which means for me = =E2=80=9Cgrapheme clusters=E2=80=9D, not =E2=80=9Ccode points=E2=80=9D, = so that mb_string functions are useless for me). In particular, = PREG_OFFSET_CAPTURE does always what I need, even when using the /u = flag. If this is a feature that you want to implement, I suggests adding a = flag PREG_UTF8_CODEPOINT_OFFSET_CAPTURE. =E2=80=94Claude --Apple-Mail=_FFB96616-EC34-4862-8B60-BB2BF0D64A48--