Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:75233
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CAHMUw2GZx0nv2SCcchy_3T=f15+WpRczm0Wp3jqq5wWaAf4X7A@mail.gmail.com>
References: <CAHMUw2EohSiEszkxER4CsR9L0FWz=P9audQ5o6WxLpn4KQjXfA@mail.gmail.com>
 <679D0316-74C5-4AEC-9097-5E9793937469@ajf.me> <53B1590F.5070009@gmail.com>
 <CAKOpQSw16jVjbFE471A4DBZg5Sr2ONsge_y1zRHPNeO1pORtOg@mail.gmail.com>
 <CAHMUw2GpdLpQsQigEp+WTxp7Vq7HLS6wzZ7o6Pt1e0fYp72z5w@mail.gmail.com>
 <CAHMUw2FzGucbgaXCF9D35qDZJB5ovu-4YOMGWwpVhCHxiddn8A@mail.gmail.com>
 <CAPg3Xx+DyZzc2Orku6kQoFSfHwRjB0CoGL4z+76iTdt1LZoLww@mail.gmail.com> <CAHMUw2GZx0nv2SCcchy_3T=f15+WpRczm0Wp3jqq5wWaAf4X7A@mail.gmail.com>
Date: Thu, 3 Jul 2014 14:45:30 +0100
Message-ID: <CAPg3XxJfErfmb8ypuxFcxF2G6RZB-ADm1BCPH3S_javb_F1cKQ@mail.gmail.com>
To: Tjerk Meesters <tjerk.meesters@gmail.com>
Cc: Kris Craig <kris.craig@gmail.com>, Rowan Collins <rowan.collins@gmail.com>, 
	PHP internals list <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=001a11390c9c42bb2604fd4a3b91
Subject: Re: [PHP-DEV] Re: ucwords() vs title case
From: petercowburn@gmail.com (Peter Cowburn)

--001a11390c9c42bb2604fd4a3b91
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 3 July 2014 14:15, Tjerk Meesters <tjerk.meesters@gmail.com> wrote:

> Hi!
>
>
> On Thu, Jul 3, 2014 at 8:56 PM, Peter Cowburn <petercowburn@gmail.com>
> wrote:
>
>>
>>
>>
>> On 3 July 2014 13:39, Tjerk Meesters <tjerk.meesters@gmail.com> wrote:
>>
>>> On Wed, Jul 2, 2014 at 1:19 AM, Tjerk Meesters <tjerk.meesters@gmail.co=
m
>>> >
>>> wrote:
>>>
>>> > Hi Kris,
>>> >
>>> >
>>> > On Tue, Jul 1, 2014 at 7:25 AM, Kris Craig <kris.craig@gmail.com>
>>> wrote:
>>> >
>>> >> On Mon, Jun 30, 2014 at 5:33 AM, Rowan Collins <
>>> rowan.collins@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > Andrea Faulds wrote (on 30/06/2014):
>>> >> >
>>> >> >> On 30 Jun 2014, at 12:54, Tjerk Meesters <tjerk.meesters@gmail.co=
m
>>> >
>>> >> >> wrote:
>>> >> >>
>>> >> >>  Hi internals,
>>> >> >>>
>>> >> >>> I came across this old bug: https://bugs.php.net/bug.php?id=3D34=
407
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Personally I find that the latter is too much of a departure fro=
m
>>> >> what we
>>> >> >>> currently have; a compromise could be to treat punctuation as a
>>> word
>>> >> >>> delimiter.
>>> >> >>>
>>> >> >> Hmm. Why not make it follow what \b in a regex would do, looking
>>> for
>>> >> >> =E2=80=9Cword boundaries=E2=80=9D?
>>> >> >>
>>> >> >
>>> >> > Unfortunately, the cleverer you try to be, the more edge cases you
>>> find.
>>> >> > For instance, using \b will capitalise the 's' after an apostrophe=
,
>>> >> e.g. in
>>> >> > "Andrea'S Suggestion".
>>> >> >
>>> >> > The function we have in our code base at the moment looks like thi=
s:
>>> >> >
>>> >> > function smart_uc_words($string)
>>> >> > {
>>> >> >         $string =3D strtolower(trim($string));
>>> >> >         // Capitalise any word char preceded by a non-word char
>>> other
>>> >> than
>>> >> > an apostrophe
>>> >> >         $string =3D preg_replace_callback('/(?<!\w|\')(\w)/',
>>> >> function($m){
>>> >> > return strtoupper($m[1]); }, $string);
>>> >> >         // Capitalise any word char which comes between an
>>> apostrophe
>>> >> and
>>> >> > another word char
>>> >> >         $string =3D preg_replace_callback('/(?<=3D\')(\w)(?=3D\w)/=
',
>>> >> > function($m){ return strtoupper($m[1]); }, $string);
>>> >> >
>>> >> >         return $string;
>>> >> > }
>>> >> >
>>> >>
>>> >> What about leaving the default behavior as-is but adding an optional
>>> >> argument to specify how to determine these boundaries?  So if you di=
d
>>> >> something like ucwords( "hello, world!", '\b' ) or ucwords( "hello,
>>> >> world!", array( ' ', '.', ... ) ), the user could control the behavi=
or
>>> >> while existing ucwords( $arg ) code would behave as it does now
>>> without
>>> >> any
>>> >> BC.
>>> >>
>>> >
>>> > Yeah, that seems like an option, so basically how `trim()` works too;
>>> > treat these characters as word boundaries (default is " \t\r\n").
>>> >
>>> >     ucwords("hello (new) world", " ()");
>>> >
>>> > I'll prepare a PR for this and see how far that takes us :) let me
>>> know if
>>> > you guys have any other ideas.
>>> >
>>>
>>> I've created a PR here: https://github.com/php/php-src/pull/706
>>
>>
>> Your previous mail mentioned, "so basically how `trim()` works too", but
>> the PR doesn't quite do that.
>>
>
> That's somewhat embarrassing; I didn't realise that character ranges are
> supported in trim() =3DS
>
> Despite this oversight, I personally don't see a practical need in
> supporting a character range because the given characters are not likely =
to
> be letters, but rather hyphens, braces, punctuation marks, spaces, etc. I
> was also hoping to keep the function rather simple :)
>

The charmasks aren't limited to letters only.  You could go crazy and use
"\0../;..@[..`{..\x7F" for everything non-alphanumeric in ASCII, if you
really wanted to.

That said, I have no particular preference one way or the other (for
ucwords()) and was mostly just clarifying the point about working how
trim() works, or not.


>
>
>>
>> Should ucwords() also accept character ranges, just like trim()?  i.e.,
>> ucwords("Foo bar", "a..z");  [not a very practical example, I know]
>>
>>
>>>
>>>
>>> If there are no objections I would like to commit this into 5.4 onwards
>>> somewhere next week.
>>>
>>> Thanks.
>>>
>>>
>>> >
>>> >
>>> >
>>> >> --Kris
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > --
>>> > Tjerk
>>> >
>>>
>>>
>>>
>>> --
>>> --
>>> Tjerk
>>>
>>
>>
>
>
> --
> --
> Tjerk
>

--001a11390c9c42bb2604fd4a3b91--