Proposal for a new basic function: str_contains

5 years ago by Peter Bowyer — view source

unread

On Fri, 14 Feb 2020 at 09:18, Philipp Tanlak philipp.tanlak@gmail.com
wrote:

I would like to propose the new basic function: str_contains.

The proposed signature for this function follows the conventions of other
signatures of string functions and should look like this:
str_contains(string $haystack, string $needle): bool
What are your opinions on this proposal?

In principle, yes. There are a couple of considerations first, like how you
plan to handle case-insensitive matches; and previous discussions for this
and the wider context of related string functions:
https://externals.io/message/106162
https://externals.io/message/100142
https://externals.io/message/94787
https://wiki.php.net/rfc/add_str_begin_and_end_functions

Peter

5 years ago by Aegir Leet — view source

unread

I generally like the idea, but it seems many (most?) real-world
implementations actually use mb_strpos() !== false by default.

https://github.com/danielstjules/Stringy/blob/df24ab62d2d8213bbbe88cc36fc35a4503b4bd7e/src/Stringy.php#L206-L215
https://github.com/illuminate/support/blob/6eff6cff19f7ad5540b9a61a9fb3612ca8218c19/Str.php#L157-L166

So there should definitely be an mb_str_contains in ext/mbstring in
addition to the regular str_contains proposed here.

5 years ago by G. P. B. — view source

unread

I generally like the idea, but it seems many (most?) real-world
implementations actually use mb_strpos() !== false by default.

https://github.com/danielstjules/Stringy/blob/df24ab62d2d8213bbbe88cc36fc35a4503b4bd7e/src/Stringy.php#L206-L215

https://github.com/illuminate/support/blob/6eff6cff19f7ad5540b9a61a9fb3612ca8218c19/Str.php#L157-L166

So there should definitely be an mb_str_contains in ext/mbstring in
addition to the regular str_contains proposed here.

The biggest reason to have an mb_* variant if for when comparing with case
insensitivity.
The only other reason is if you need to check a string which is in a
different encoding,
which is, I'm assuming, is a quasi non-existent problem as everything
things is UTF-8
nowadays.

The reason why I personally voted no on the previous RFC was that I don't
see the
value of having functions checking if a string starts/ends with a sequence
but not a
general one. Moreover, checking for a substring to start/end a string seems
to be
fitting for the current strpos functions.

This function on it's own is way more reasonable and useful to add IMHO

Best regards

George P. Banyard

5 years ago by Guilliam Xavier — view source

unread

Moreover, checking for a substring to start/end a string seems
to be
fitting for the current strpos functions.

Maybe in terms of semantics (0 === strpos($haystack, $needle)), but
suboptimal in terms of performance, especially when $haystack is a
very long string which doesn't contain $needle, strpos() will
vainly search along the whole string, while a specialized function
would stop as soon as possible (which is also the case of existing
strncmp() but you need to write 0 === strncmp($haystack, $needle, strlen($needle)), arguably not really the cleanest code...).

For "contains" you have to search along the whole string anyway, so
str_contains() is "just" false !==strpos()`` but cleaner.

To be clear, I'm not against the current proposal (rather for
actually) [I just would want str_{starts,ends}_with even more
(without case-insensitive nor multibyte variants)]

--
Guilliam Xavier

I generally like the idea, but it seems many (most?) real-world
implementations actually use mb_strpos() !== false by default.

https://github.com/danielstjules/Stringy/blob/df24ab62d2d8213bbbe88cc36fc35a4503b4bd7e/src/Stringy.php#L206-L215

https://github.com/illuminate/support/blob/6eff6cff19f7ad5540b9a61a9fb3612ca8218c19/Str.php#L157-L166

So there should definitely be an mb_str_contains in ext/mbstring in
addition to the regular str_contains proposed here.

The biggest reason to have an mb_* variant if for when comparing with case
insensitivity.
The only other reason is if you need to check a string which is in a
different encoding,
which is, I'm assuming, is a quasi non-existent problem as everything
things is UTF-8
nowadays.

The reason why I personally voted no on the previous RFC was that I don't
see the
value of having functions checking if a string starts/ends with a sequence
but not a
general one. Moreover, checking for a substring to start/end a string seems
to be
fitting for the current strpos functions.

This function on it's own is way more reasonable and useful to add IMHO

Best regards

George P. Banyard

--
Guilliam Xavier

5 years ago by Philipp Tanlak — view source

unread

Now that we've talked about the pros and cons of case-insensitivity and
multibyte variants, I'm still unsure what your opinions on those are.

Should we include a case-insensitive variant (str_icontains) ?
Should we include multibyte variants (mb_str_icontains) ?

Slightly off-topic:
Also, since this is my first time I'm trying to contribute: How can we
proceed to write an RFC?
I've read in the howto, that I need to earn RFC karma in order to create a
new RFC page. How can I request that?
My wiki.php.net username is: philippta and my email is
philipp.tanlak@gmail.com

Thanks for your help :)

5 years ago by Nikita Popov — view source

unread

On Mon, Feb 17, 2020 at 10:03 AM Philipp Tanlak philipp.tanlak@gmail.com
wrote:

Now that we've talked about the pros and cons of case-insensitivity and
multibyte variants, I'm still unsure what your opinions on those are.

Should we include a case-insensitive variant (str_icontains) ?

Should we include multibyte variants (mb_str_icontains) ?

Especially considering how past proposals in this general area went, I'd
suggest to start small (just str_contains), and then go from there.
(Personally I'd like to have the trifecta of str_contains, str_starts_with
and str_ends_with in one go, but given that a proposal for the latter two
recently failed... though the main contention there seems to be the
case-insensitive part, not the functions themselves.)

Slightly off-topic:

Also, since this is my first time I'm trying to contribute: How can we
proceed to write an RFC?
I've read in the howto, that I need to earn RFC karma in order to create a
new RFC page. How can I request that?
My wiki.php.net username is: philippta and my email is
philipp.tanlak@gmail.com

I've granted you RFC karma on the wiki, so you should be able to add a new
page under wiki.php.net/rfcs now.

Regards,
Nikita

5 years ago by Philipp Tanlak — view source

unread

Am Mo., 17. Feb. 2020 um 10:53 Uhr schrieb Nikita Popov <
nikita.ppv@gmail.com>:

On Mon, Feb 17, 2020 at 10:03 AM Philipp Tanlak philipp.tanlak@gmail.com
wrote:

Now that we've talked about the pros and cons of case-insensitivity and
multibyte variants, I'm still unsure what your opinions on those are.

Should we include a case-insensitive variant (str_icontains) ?

Should we include multibyte variants (mb_str_icontains) ?

Especially considering how past proposals in this general area went, I'd
suggest to start small (just str_contains), and then go from there.
(Personally I'd like to have the trifecta of str_contains, str_starts_with
and str_ends_with in one go, but given that a proposal for the latter two
recently failed... though the main contention there seems to be the
case-insensitive part, not the functions themselves.)

Slightly off-topic:

Also, since this is my first time I'm trying to contribute: How can we
proceed to write an RFC?
I've read in the howto, that I need to earn RFC karma in order to create a
new RFC page. How can I request that?
My wiki.php.net username is: philippta and my email is
philipp.tanlak@gmail.com

I've granted you RFC karma on the wiki, so you should be able to add a new
page under wiki.php.net/rfcs now.

Regards,
Nikita

Thanks for the karma! An RFC has been created:
https://wiki.php.net/rfc/str_contains

Kind Regards,
Philipp

5 years ago by Benjamin Morel — view source

unread

Thanks for the karma! An RFC has been created:
https://wiki.php.net/rfc/str_contains

Something that's missing from the RFC is the behaviour when $needle is an
empty string:

str_contains('abc', '');
str_contains('', '');

Will these always return false?

— Benjamin

5 years ago by Nikita Popov — view source

unread

On Mon, Feb 17, 2020 at 12:49 PM Benjamin Morel benjamin.morel@gmail.com
wrote:

Thanks for the karma! An RFC has been created:

https://wiki.php.net/rfc/str_contains

Something that's missing from the RFC is the behaviour when $needle is an
empty string:

str_contains('abc', '');
str_contains('', '');

Will these always return false?

As of PHP 8, behavior of '' in string search functions is well defined, and
we consider '' to occur at every position in the string, including one past
the end. As such, both of these will (or at least should) return true. The
empty string is contained in every string.

Regards,
Nikita

5 years ago by Philipp Tanlak — view source

unread

Am Mo., 17. Feb. 2020 um 12:56 Uhr schrieb Nikita Popov <
nikita.ppv@gmail.com>:

On Mon, Feb 17, 2020 at 12:49 PM Benjamin Morel benjamin.morel@gmail.com
wrote:

Thanks for the karma! An RFC has been created:

https://wiki.php.net/rfc/str_contains

Something that's missing from the RFC is the behaviour when $needle is an
empty string:

str_contains('abc', '');
str_contains('', '');

Will these always return false?

As of PHP 8, behavior of '' in string search functions is well defined,
and we consider '' to occur at every position in the string, including one
past the end. As such, both of these will (or at least should) return true.
The empty string is contained in every string.

Regards,
Nikita

Thanks for the hint Benjamin. I've cited Nikita and added that to the RFC
for clarification.

5 years ago by Claude Pache — view source

unread

Le 14 févr. 2020 à 10:17, Philipp Tanlak philipp.tanlak@gmail.com a écrit :

Hello PHP Devs,

I would like to propose the new basic function: str_contains.

The goal of this proposal is to standardize on a function, to check weather
or not a string is contained in another string, which has a very common
use-case in almost every PHP project.
PHP Frameworks like Laravel create helper functions for this behavior
because it is so ubiquitous.

Some time ago, an RFC proposing to add str_starts_with() and str_ends_with() was unfortunately declined:

https://wiki.php.net/rfc/add_str_begin_and_end_functions https://wiki.php.net/rfc/add_str_begin_and_end_functions

Therefore, unless several people have changed their mind in the right direction since few months ago, I am pessimistic about the acceptance of str_contains().

—Claude

5 years ago by Nikita Popov — view source

unread

On Fri, Feb 14, 2020 at 10:18 AM Philipp Tanlak philipp.tanlak@gmail.com
wrote:

Hello PHP Devs,

I would like to propose the new basic function: str_contains.

The goal of this proposal is to standardize on a function, to check weather
or not a string is contained in another string, which has a very common
use-case in almost every PHP project.
PHP Frameworks like Laravel create helper functions for this behavior
because it is so ubiquitous.

There are currently a couple of approaches to create such a behavior, most
commonly:
<?php
strpos($haystack, $needle) !== false;
strstr($haystack, $needle) !== false;
preg_match('/' . $needle . '/', $haystack) != 0;

All of these functions serve the same purpose but are either not intuitive,
easy to get wrong (especially with the !== comparison) or hard to remember
for new PHP developers.

The proposed signature for this function follows the conventions of other
signatures of string functions and should look like this:
str_contains(string $haystack, string $needle): bool
This function is very easy to implement, has no side effects or backward
compatibility issues.
I've implemented this feature and created a pull request on GitHub ( Link:
https://github.com/php/php-src/pull/5179 ).

To get this function into the PHP core, I will open up an RFC for this.
But first, I would like to get your opinions and consensus on this
proposal.

What are your opinions on this proposal?

Sounds good to me. This operation is needed often enough that it deserves a
dedicated function.

I'd recommend leaving the proposal at only str_contains(), in particular:

Do not propose a case-insensitive variant. I believe this is really the
point on which the last str_starts_with/str_ends_with proposal failed.
Do not propose mb_str_contains(). Especially as no offsets are involved,
there is no reason to have this function. (For UTF-8, the behavior would be
exactly equivalent to str_contains.)

Regards,
Nikita

5 years ago by Philipp Tanlak — view source

unread

Am Fr., 14. Feb. 2020 um 12:54 Uhr schrieb Nikita Popov <
nikita.ppv@gmail.com>:

On Fri, Feb 14, 2020 at 10:18 AM Philipp Tanlak philipp.tanlak@gmail.com
wrote:
Hello PHP Devs,

I would like to propose the new basic function: str_contains.

The goal of this proposal is to standardize on a function, to check
weather
or not a string is contained in another string, which has a very common
use-case in almost every PHP project.
PHP Frameworks like Laravel create helper functions for this behavior
because it is so ubiquitous.

There are currently a couple of approaches to create such a behavior, most
commonly:
<?php
strpos($haystack, $needle) !== false;
strstr($haystack, $needle) !== false;
preg_match('/' . $needle . '/', $haystack) != 0;

All of these functions serve the same purpose but are either not
intuitive,
easy to get wrong (especially with the !== comparison) or hard to remember
for new PHP developers.

The proposed signature for this function follows the conventions of other
signatures of string functions and should look like this:
str_contains(string $haystack, string $needle): bool
This function is very easy to implement, has no side effects or backward
compatibility issues.
I've implemented this feature and created a pull request on GitHub ( Link:
https://github.com/php/php-src/pull/5179 ).

To get this function into the PHP core, I will open up an RFC for this.
But first, I would like to get your opinions and consensus on this
proposal.

What are your opinions on this proposal?
Sounds good to me. This operation is needed often enough that it deserves
a dedicated function.

I'd recommend leaving the proposal at only str_contains(), in particular:

Do not propose a case-insensitive variant. I believe this is really the
point on which the last str_starts_with/str_ends_with proposal failed.

Do not propose mb_str_contains(). Especially as no offsets are
involved, there is no reason to have this function. (For UTF-8, the
behavior would be exactly equivalent to str_contains.)

Regards,
Nikita

I like to elaborate on Nikitas response: I don't think a mb_str_contains is
necessary, because the proposed function does not behave differently, if
the input strings are multibyte strings.
When searched for a multibyte string in another multibyte string, the
return value would consistently be true/false. The position/offset at which
the multibyte string was found is not relevant.
The reason for the existence of a strpos/mb_strpos is the fact, that the
returned position/offset varies depending on weather or not the string is a
multibyte string or not.

The only possible valid variants concerning multibyte and incasesensitivity
I see are:

str_contains: works as expected with multibyte and non multibyte strings.
mb_str_icontains: is the only valid option to do a incasesensitive search
for multibyte strings.

Unneeded variants I see are:

mb_str_contains: does not behave differently when compared to
str_contains, as mentioned above.
str_icontains: is a possible option but could be error prone for when
used with multibyte strings like UTF-8, as it is de facto the standard
nowadays.

I'm certain there would be confusion among php developers when the newly
proposed functions are only str_contains and mb_str_icontains.

Patrick ALLAERT:
Yes, it does have one: people having already defined a str_contains()
function in the global scope will have a PHP Fatal error: Cannot redeclare
str_contains()

You are absolutely correct with this. Although functions added by
frameworks to the global scope are usually guarded by: if
(!function_exists('str_contains')) {}

5 years ago by Andrea Faulds — view source

unread

Hi,

Philipp Tanlak wrote:

I like to elaborate on Nikitas response: I don't think a mb_str_contains is
necessary, because the proposed function does not behave differently, if
the input strings are multibyte strings.

This is not true for all character encodings. For UTF-8 it is correct,
but consider for example the Japanese encoding Shift_JIS, where the
second byte of a multi-byte character can be a valid first byte of a
single-byte character. str_contains() would have incorrect behaviour for
this case.

Regards,
Andrea Faulds

5 years ago by Pierre Joye — view source

unread

hello,

On Fri, Feb 14, 2020 at 10:18 AM Philipp Tanlak philipp.tanlak@gmail.com
wrote:
Hello PHP Devs,

I would like to propose the new basic function: str_contains.

The goal of this proposal is to standardize on a function, to check
weather
or not a string is contained in another string, which has a very common
use-case in almost every PHP project.
PHP Frameworks like Laravel create helper functions for this behavior
because it is so ubiquitous.

There are currently a couple of approaches to create such a behavior,
most
commonly:
<?php
strpos($haystack, $needle) !== false;
strstr($haystack, $needle) !== false;
preg_match('/' . $needle . '/', $haystack) != 0;

All of these functions serve the same purpose but are either not
intuitive,
easy to get wrong (especially with the !== comparison) or hard to
remember
for new PHP developers.

The proposed signature for this function follows the conventions of other
signatures of string functions and should look like this:
str_contains(string $haystack, string $needle): bool
This function is very easy to implement, has no side effects or backward
compatibility issues.
I've implemented this feature and created a pull request on GitHub (
Link:
https://github.com/php/php-src/pull/5179 ).

To get this function into the PHP core, I will open up an RFC for this.
But first, I would like to get your opinions and consensus on this
proposal.

What are your opinions on this proposal?
Sounds good to me. This operation is needed often enough that it deserves a
dedicated function.

I'd recommend leaving the proposal at only str_contains(), in particular:

Do not propose a case-insensitive variant. I believe this is really the
point on which the last str_starts_with/str_ends_with proposal failed.

Do not propose mb_str_contains(). Especially as no offsets are involved,
there is no reason to have this function. (For UTF-8, the behavior would be
exactly equivalent to str_contains.)

Btw, while some mbstring references I I mentioned, I do like the ICU search
implementation as well.

http://userguide.icu-project.org/collation/icu-string-search-service

It handles a lot of cases based on locales.

Regards,
Nikita

5 years ago by Rowan Tommins — view source

unread

Btw, while some mbstring references I I mentioned, I do like the ICU search
implementation as well.

http://userguide.icu-project.org/collation/icu-string-search-service

It handles a lot of cases based on locales.

That's a lovely example of why treating Unicode as a character encoding is
the wrong mindset.

I would love to see more people using ext/intl rather than ext/mbstring,
and more ICU features like this being included.

Regards,

Rowan Tommins
[IMSoP]

5 years ago by Andreas Heigl — view source

unread

Hey all.

Just a short note why I voted against the current implementation of the
str_contains functionality.

While it is mainly aimed at being a mere convenience-function that could
also be easily implemented in userland it misses one main thing IMO when
handling unicode-strings: Normalization.

It is correct, that the binary representation of the string "äöüß"
within the string "Täöüßtstring" seems to be the same and that a simple
strpos('Täöüßtstring', 'äöüß') results in a not-false result.

But using unicode it might be that the two strings are using different
normalizations. So for the human eye the two strings look (almost)
identical but internaly they are completely different (and even
mb_strpos might not be able to detect the similarity).

See https://3v4l.org/fasO4 for more information.

As we are creating new functionality here it would have been great to
solve this issue. But as it is IMO merely a convenience add on that can
easily be implemented in userland I vote against it.

Cheers

Andreas

Am 17.02.20 um 15:23 schrieb Rowan Tommins:

Btw, while some mbstring references I I mentioned, I do like the ICU search
implementation as well.

http://userguide.icu-project.org/collation/icu-string-search-service

It handles a lot of cases based on locales.

That's a lovely example of why treating Unicode as a character encoding is
the wrong mindset.

I would love to see more people using ext/intl rather than ext/mbstring,
and more ICU features like this being included.

Regards,

--
,,,
(o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| http://andreas.heigl.org http://hei.gl/wiFKy7 |
+---------------------------------------------------------------------+
| http://hei.gl/root-ca |
+---------------------------------------------------------------------+

5 years ago by Rowan Tommins — view source

unread

While it is mainly aimed at being a mere convenience-function that could
also be easily implemented in userland it misses one main thing IMO when
handling unicode-strings: Normalization.

While I would love to see more functionality for handling Unicode which
didn't treat it as just another character set, I don't think sprinkling it
into the main string functions of the language would be the right approach.
Even if we changed all the existing functions to be "Unicode-aware", as was
planned for PHP 6, the resulting API would not handle all cases correctly.

In this case, a Unicode-based string API ought to provide at least two
variants of "contains", as options or separate functions:

a version which matches on code point, for answering queries like "does
this string contain right-to-left override characters?"
at least one form of normalization, but probably several

If there was serious work on a new string API in progress, a freeze on
additions to the current API would make sense; but right now, the
byte-based string API is what we have, and I think this function is a
sensible addition to it.

Regards,

Rowan Tommins
[IMSoP]

5 years ago by Nicolas Grekas — view source

unread

Le mar. 3 mars 2020 à 11:04, Rowan Tommins rowan.collins@gmail.com a
écrit :

While it is mainly aimed at being a mere convenience-function that could
also be easily implemented in userland it misses one main thing IMO when
handling unicode-strings: Normalization.

While I would love to see more functionality for handling Unicode which
didn't treat it as just another character set, I don't think sprinkling it
into the main string functions of the language would be the right approach.
Even if we changed all the existing functions to be "Unicode-aware", as was
planned for PHP 6, the resulting API would not handle all cases correctly.

In this case, a Unicode-based string API ought to provide at least two
variants of "contains", as options or separate functions:

a version which matches on code point, for answering queries like "does
this string contain right-to-left override characters?"

at least one form of normalization, but probably several

If there was serious work on a new string API in progress, a freeze on
additions to the current API would make sense; but right now, the
byte-based string API is what we have, and I think this function is a
sensible addition to it.

FYI, I wrote a String handling lib, shipped as Symfony String:

TL;DR, it provides 3 classes of value objects, dealing with bytes, code
points and grapheme cluster (~= normalized unicode)

It makes no sense to have str_contains() or any global function able to
deal with Unicode normalization unless the PHP string values embed their
unit system (one of: bytes, codepoints or graphemes).

With this rationale, I agree with Rowan: PHP's native string functions deal
with bytes. So should str_contains(). Other unit systems can be implemented
in userland (until PHP implements something similar to Symfony String in
core - but that's another topic.)

Nicolas

5 years ago by Andreas Heigl — view source

unread

Am 03.03.20 um 14:29 schrieb Nicolas Grekas:

Le mar. 3 mars 2020 à 11:04, Rowan Tommins rowan.collins@gmail.com a
écrit :

While it is mainly aimed at being a mere convenience-function that could
also be easily implemented in userland it misses one main thing IMO when
handling unicode-strings: Normalization.

While I would love to see more functionality for handling Unicode which
didn't treat it as just another character set, I don't think sprinkling it
into the main string functions of the language would be the right approach.
Even if we changed all the existing functions to be "Unicode-aware", as was
planned for PHP 6, the resulting API would not handle all cases correctly.

In this case, a Unicode-based string API ought to provide at least two
variants of "contains", as options or separate functions:

a version which matches on code point, for answering queries like "does
this string contain right-to-left override characters?"

at least one form of normalization, but probably several

If there was serious work on a new string API in progress, a freeze on
additions to the current API would make sense; but right now, the
byte-based string API is what we have, and I think this function is a
sensible addition to it.

FYI, I wrote a String handling lib, shipped as Symfony String:

doc: https://symfony.com/doc/current/components/string.html

src: https://github.com/symfony/string

TL;DR, it provides 3 classes of value objects, dealing with bytes, code
points and grapheme cluster (~= normalized unicode)

It makes no sense to have str_contains() or any global function able to
deal with Unicode normalization unless the PHP string values embed their
unit system (one of: bytes, codepoints or graphemes).

With this rationale, I agree with Rowan: PHP's native string functions deal
with bytes. So should str_contains(). Other unit systems can be implemented
in userland (until PHP implements something similar to Symfony String in
core - but that's another topic.)

str_contains as it currently is implemented can also easily be
implemented in userland. That was my reasoning. I would think otherwise
would it take unicode into account as that's much harder to implement in
userland.

And I didn'T want to start a new discussion, I merely wanted to explain
the reasoning behind my decission.

Cheers

Andreas

                                                          ,,,
                                                         (o o)

+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| http://andreas.heigl.org http://hei.gl/wiFKy7 |
+---------------------------------------------------------------------+
| http://hei.gl/root-ca |
+---------------------------------------------------------------------+