Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:106043 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 51215 invoked from network); 23 Jun 2019 18:30:59 -0000 Received: from unknown (HELO mail-lj1-f170.google.com) (209.85.208.170) by pb1.pair.com with SMTP; 23 Jun 2019 18:30:59 -0000 Received: by mail-lj1-f170.google.com with SMTP id m23so10167629lje.12 for ; Sun, 23 Jun 2019 08:46:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=xKBLROoLqem8LEs5drSEgLPZ6/743eQIlgcE54LJpoE=; b=hciFRU5ncB4XUmFdhXqLWaanO2FzlLaI4+BS3qg1UH5upGjGxoTIIhKgoRTR7DXwCi WZX4AkLZk85xNyZwnxa/+4pv3jZ4TgE1m2nJXTYyx+PtrS5pk0BaI0q3epj2dYhPjfdP MUdsWLk4+7PH/ueZ2NiukPZQUcoQpA7HemJGhq+jbnAwpiyv0ttWH1EmDLwFXQmF4ojY gNvwgygeOeAW2lHpz+9njRpRvgXz2r0R/RL3GSGfh0otSMt6h/f2pGGGbH1Jsvu67azP QmuCw4662zGtXa+th5wAM/tZc8swYCa1eFd+vznCUgm6pFI+Us4OJC4GsR0BpT8XCyk5 62MQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xKBLROoLqem8LEs5drSEgLPZ6/743eQIlgcE54LJpoE=; b=iGng9iOnJR6044RjOCh1aOeNo1yDa/dBkFmlF6ANPXzcb25abwQb0y6rmhe8/D3gUT Af2awq1qWp35Esy+WLQICbK389jHMnZ8Ld+6J+w6VYyeKdnX4qRYHDPFxj85vY5pk8/5 qfdYPYtI7Yys+saqOVHxso+2V4mpId25X7y2JrbaLaMJbd2PgdUZO3k9GwnBtCQsopNh kX+wxTRZtqdPJeh3I6UZXoYJIQP5yWzqmy3lckvGjfPg5vezLBjj6nHKBnXpOh4cai7k +Txe3/fOm+pIALocQvcDfD0G5/yqTvah/Kpmk3MU6lcyBtG42uSB1YvIlly0nfpNeEhz DYZA== X-Gm-Message-State: APjAAAVOHWfJvOaB2LWCHfbH6eWVlraqRVFl/O7p8igevQ7rmqDVJ/GE dSRE7nrsw/sFSWu2yMsWoU5vvpsJh9PE/NV8P/E= X-Google-Smtp-Source: APXvYqyRmxztf/2f+gsZxHnZlWN5DES3v72pRygG4ciOqY9hImOTwGdok705e2fl/cQdvb4bibKeNG1mAim5LY7Yfvw= X-Received: by 2002:a2e:7c15:: with SMTP id x21mr15241992ljc.55.1561304799956; Sun, 23 Jun 2019 08:46:39 -0700 (PDT) MIME-Version: 1.0 References: <8442f1fa5544b2ca03e7cebbc64e8e5c@wkhudgins.info> <683c5da474e13283030cac3d0c0ec080@wkhudgins.info> <2c37999d1e5372ae6ab48bfce5420796@wkhudgins.info> <2CF672F8-12F5-4D37-8B8C-591A6E695220@benramsey.com> <3E2100B1-7BF7-4C9F-AA77-D82924A2D5FC@gmail.com> <8CFCFE96-E2B7-456B-85A3-8737754C59D6@benramsey.com> In-Reply-To: <8CFCFE96-E2B7-456B-85A3-8737754C59D6@benramsey.com> Date: Sun, 23 Jun 2019 17:46:23 +0200 Message-ID: To: Ben Ramsey Cc: Rowan Collins , PHP internals Content-Type: multipart/alternative; boundary="000000000000fcd528058bff9914" Subject: Re: [PHP-DEV] [RFC] Desire to move RFC add_str_begin_and_end_functions to a vote From: nikita.ppv@gmail.com (Nikita Popov) --000000000000fcd528058bff9914 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Jun 23, 2019 at 5:30 PM Ben Ramsey wrote: > > On Jun 23, 2019, at 05:35, Rowan Collins > wrote: > > > > On 22 June 2019 20:56:24 BST, Ben Ramsey wrote: > >> Perhaps it would only be an issue with the case-insensitive versions, > >> as Nikita points out? If so, can someone provide some example strings > >> where an mb_starts_with_ci() would return true, while > >> str_starts_with_ci() would return false? > > > > > > That's easy: any character that has a lower- and uppercase form, and is > not represented as one byte in the target encoding. For that matter, any > such character in the non-ASCII section of a single-byte encoding, since = a > non-mbstring case insensitive flag would presumably leave everything othe= r > than ASCII letters untouched. > > > > So, any non-Latin script, like Greek or Cyrillic; any accented > characters, unless you're lucky and they're represented by ASCII-letter > plus combining modifier; the Turkish "i", which if I remember rightly has > three forms not two; and so on. > > > According to Google, "=C4=B0yi ak=C5=9Famlar=E2=80=9D is the Turkish phra= se for =E2=80=9CGood > evening=E2=80=9D (Turkish speakers, please correct me, if this wrong). Ho= wever, > using the existing mb_* functions, I can=E2=80=99t get mb_stripos() to re= turn 0 > when trying to see if the string =E2=80=9C=C4=B0YI AK=C5=9EAMLAR=E2=80=9D= begins with =E2=80=9Ci=CC=87yi.=E2=80=9D > > I=E2=80=99m just using UTF-8, so maybe there=E2=80=99s an encoding issue = here? > > $string =3D '=C4=B0yi ak=C5=9Famlar'; > $upper =3D mb_strtoupper($string); > $lowerChars =3D mb_strtolower(mb_substr($string, 0, 3)); > > var_dump($string, $upper, $lowerChars); > var_dump(mb_stripos($upper, $lowerChars)); > The reason why this doesn't work is that mb_stripos internally performs a simple case fold, while a full case fold would be needed in this case (Turkish i is hard). It's a bit tricky due to the need to remap character offsets. Nikita --000000000000fcd528058bff9914--