Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:106048 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 29610 invoked from network); 24 Jun 2019 12:12:44 -0000 Received: from unknown (HELO mail-lf1-f52.google.com) (209.85.167.52) by pb1.pair.com with SMTP; 24 Jun 2019 12:12:44 -0000 Received: by mail-lf1-f52.google.com with SMTP id j29so9478469lfk.10 for ; Mon, 24 Jun 2019 02:28:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=m4T5YK7xJuWmJVlkG/tLLhlQQrISe5gWsUXfULnp82I=; b=NaYb/r7yfiZJPPovWQQcw7tcB6MHgyy9WPrvkywEasxfwZ67Nsw6kuR5zGaGlv5Mcb KRfY0Im+1vYK6LZ6jcC5zWg9Bn9G+UpgOw82sWd0EDgsQPw8r4lS+EIYJ4DdRyLtVvpC GHe+ZOWLjnZo4ZkNulMk7tFZWkEqR4iQgSYX42aoH5eDq34f0ytYlXuj/zuWKe98BgSA XtQNph/nqazAYDhR7bjaRINJFSwGzxMR3pFN2LxCJkE525oyXttCi2i/tyEZNS6p4ZLs HWsVDyYOBj/ILch4HPU+AVWWVg8HIFUQ3YKHfr4TVxBU9tjNgTUeJfTvQOJVYtIrrrrx y9bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=m4T5YK7xJuWmJVlkG/tLLhlQQrISe5gWsUXfULnp82I=; b=mywaY4pVnjebABZdC5/7gOJx0K+ZIpXnbu3L87mwHfsLdU9KikqNvzSkptKgtGtgA7 qElORztSIf7LEvB0Otaa0pFQy08uT2kB3Ck9eDagffR3NlMH6txaASg+rXswlUrrjmun ZCm37gubNHfqwsstwceVIeHPKrgp0eJwSSmU6M+fcKveFFQr7e5lDt2M3bfU3SK/HAOw ZYdqZMQGkrNj406bDjYR7R45LCXWOkDSDoK6w8we5/pfKURZ0kJq/y8b6cH20+qDYjGF pWQ0UzvyfsKJpb3sQ1SNXtdiD/8nAagx+AafB3vtjIIcrN2c3fPfU5e23NP6x2A4CJxo /1WA== X-Gm-Message-State: APjAAAWL1mC04znStpBU54HK7zu3uqvCrF8tnBEwWYy7NcMj6m0bOUFh sZw6Y6kLAyafpfHpOevQVXlpOwrzsOD41+P/Uc8= X-Google-Smtp-Source: APXvYqwbXcleoxpwz0uz+tpzx7DgTNLs7R9A9u0AevDtiI4osyaeTc6R2h3EimfMprWQL9fJBc62REC+FATRZHl7Q1M= X-Received: by 2002:ac2:43b7:: with SMTP id t23mr26456932lfl.110.1561368515159; Mon, 24 Jun 2019 02:28:35 -0700 (PDT) MIME-Version: 1.0 References: <8442f1fa5544b2ca03e7cebbc64e8e5c@wkhudgins.info> <683c5da474e13283030cac3d0c0ec080@wkhudgins.info> <2c37999d1e5372ae6ab48bfce5420796@wkhudgins.info> <2CF672F8-12F5-4D37-8B8C-591A6E695220@benramsey.com> <3E2100B1-7BF7-4C9F-AA77-D82924A2D5FC@gmail.com> <8CFCFE96-E2B7-456B-85A3-8737754C59D6@benramsey.com> In-Reply-To: Date: Mon, 24 Jun 2019 11:28:18 +0200 Message-ID: To: Ben Ramsey Cc: Rowan Collins , PHP internals Content-Type: multipart/alternative; boundary="000000000000b5aacb058c0e6f99" Subject: Re: [PHP-DEV] [RFC] Desire to move RFC add_str_begin_and_end_functions to a vote From: nikita.ppv@gmail.com (Nikita Popov) --000000000000b5aacb058c0e6f99 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Jun 23, 2019 at 5:46 PM Nikita Popov wrote: > On Sun, Jun 23, 2019 at 5:30 PM Ben Ramsey wrote: > >> > On Jun 23, 2019, at 05:35, Rowan Collins >> wrote: >> > >> > On 22 June 2019 20:56:24 BST, Ben Ramsey wrote: >> >> Perhaps it would only be an issue with the case-insensitive versions, >> >> as Nikita points out? If so, can someone provide some example strings >> >> where an mb_starts_with_ci() would return true, while >> >> str_starts_with_ci() would return false? >> > >> > >> > That's easy: any character that has a lower- and uppercase form, and i= s >> not represented as one byte in the target encoding. For that matter, any >> such character in the non-ASCII section of a single-byte encoding, since= a >> non-mbstring case insensitive flag would presumably leave everything oth= er >> than ASCII letters untouched. >> > >> > So, any non-Latin script, like Greek or Cyrillic; any accented >> characters, unless you're lucky and they're represented by ASCII-letter >> plus combining modifier; the Turkish "i", which if I remember rightly ha= s >> three forms not two; and so on. >> >> >> According to Google, "=C4=B0yi ak=C5=9Famlar=E2=80=9D is the Turkish phr= ase for =E2=80=9CGood >> evening=E2=80=9D (Turkish speakers, please correct me, if this wrong). H= owever, >> using the existing mb_* functions, I can=E2=80=99t get mb_stripos() to r= eturn 0 >> when trying to see if the string =E2=80=9C=C4=B0YI AK=C5=9EAMLAR=E2=80= =9D begins with =E2=80=9Ci=CC=87yi.=E2=80=9D >> >> I=E2=80=99m just using UTF-8, so maybe there=E2=80=99s an encoding issue= here? >> >> $string =3D '=C4=B0yi ak=C5=9Famlar'; >> $upper =3D mb_strtoupper($string); >> $lowerChars =3D mb_strtolower(mb_substr($string, 0, 3)); >> >> var_dump($string, $upper, $lowerChars); >> var_dump(mb_stripos($upper, $lowerChars)); >> > > The reason why this doesn't work is that mb_stripos internally performs a > simple case fold, while a full case fold would be needed in this case > (Turkish i is hard). It's a bit tricky due to the need to remap character > offsets. > I've implemented use of full case folding in https://github.com/php/php-src/pull/4303. While doing that I kind of convinced myself that we probably shouldn't actually do this, because it breaks simple mb_stripos loops in a subtle way. It probably makes more sense for people to explicitly call mb_convert_case($string, MB_CASE_FOLD) and then operate on the resulting strings. Both much more efficient, and avoids offset remapping issues. Nikita --000000000000b5aacb058c0e6f99--