Negative string offsets

14 years ago by Derick Rethans — view source

unread

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

$str[0] == "H"
$str[-1] == "o";

If -6 is used as offset, the old warning is displayed because it's the first
undefined negative offset.

The same thing for setting:

$str[-1] = '0';
$str[-4] = "4";

will result in "H4ll0"

Would be glad to see this in 5.4

Sounds like a good addition to me! For ArrayAccess, would this calculate
the "correct" index so that current implementations of ArrayAccess don't
have to be changed?

cheers,
Derick

--
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug

14 years ago by Etienne Kneuss — view source

unread

Hi,

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

$str[0] == "H"
$str[-1] == "o";

If -6 is used as offset, the old warning is displayed because it's the first
undefined negative offset.

The same thing for setting:

$str[-1] = '0';
$str[-4] = "4";

will result in "H4ll0"

Would be glad to see this in 5.4

Sounds like a good addition to me! For ArrayAccess, would this calculate
the "correct" index so that current implementations of ArrayAccess don't
have to be changed?

Do you mean ArrayObject? ArrayAccess is the interface.
Regardless, I don't believe it makes sense to change the semantics of
those indexes for arrays, since arrays can define negative indexes.
i.e. $a = array(-1 => "foo", 2 => "bar"); $a[-1] should really be
"foo", and not "bar".

This looks useful for strings though!

cheers,
Derick

--
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug

--

--
Etienne Kneuss
http://www.colder.ch

14 years ago by johannes@schlueters.de — view source

unread

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

Do you mean ArrayObject? ArrayAccess is the interface.
Regardless, I don't believe it makes sense to change the semantics of
those indexes for arrays, since arrays can define negative indexes.
i.e. $a = array(-1 => "foo", 2 => "bar"); $a[-1] should really be
"foo", and not "bar".

This clearly shows the inconsistency this brings. Maybe $var{$offset}
should be clearly deprecated for arrays and $var[$offset] for strings as
in PHP they work differently.

johannes

14 years ago by Robert Eisele — view source

unread

I would not consider this for arrays and objects, too. If we had real
arrays, this would make sense but they are HT's and therewith it can also be
explained that -1 is an element and not the end of the chained list behind
the HT.

2011/6/20 Johannes Schlüter johannes@schlueters.de

Negative string offsets is a wish and also an implementation of my
running
PHP version for long. It operates in the same fashion like substr()
with
negative offsets, but avoids the function call and is much smarter if
one
single character has to be extracted:

Do you mean ArrayObject? ArrayAccess is the interface.
Regardless, I don't believe it makes sense to change the semantics of
those indexes for arrays, since arrays can define negative indexes.
i.e. $a = array(-1 => "foo", 2 => "bar"); $a[-1] should really be
"foo", and not "bar".

This clearly shows the inconsistency this brings. Maybe $var{$offset}
should be clearly deprecated for arrays and $var[$offset] for strings as
in PHP they work differently.

johannes

14 years ago by johannes@schlueters.de — view source

unread

I would not consider this for arrays and objects, too. If we had real
arrays, this would make sense but they are HT's and therewith it can also be
explained that -1 is an element and not the end of the chained list behind
the HT.

Yes. So having this in the current form accepted means that

$a[-1];

can have two meanings:

1) Get the last item (byte in a string)
2) Get item `-1` (in an array)

Which are to different things.

Currently we treat

$a{$o} and $a[$o]

as equal. My suggestion was to split this up to avoid the conflict from
above. I didn't suggest adding support for $a[-1] as last element for
arrays, I know quite well why this won't make sense.

johannes

2011/6/20 Johannes Schlüter johannes@schlueters.de

Negative string offsets is a wish and also an implementation of my
running
PHP version for long. It operates in the same fashion like substr()
with
negative offsets, but avoids the function call and is much smarter if
one
single character has to be extracted:

Do you mean ArrayObject? ArrayAccess is the interface.
Regardless, I don't believe it makes sense to change the semantics of
those indexes for arrays, since arrays can define negative indexes.
i.e. $a = array(-1 => "foo", 2 => "bar"); $a[-1] should really be
"foo", and not "bar".

This clearly shows the inconsistency this brings. Maybe $var{$offset}
should be clearly deprecated for arrays and $var[$offset] for strings as
in PHP they work differently.

johannes

14 years ago by Robert Eisele — view source

unread

2011/6/20 Johannes Schlüter johannes@schlueters.de

I would not consider this for arrays and objects, too. If we had real
arrays, this would make sense but they are HT's and therewith it can also
be
explained that -1 is an element and not the end of the chained list
behind
the HT.

Yes. So having this in the current form accepted means that

$a[-1];

can have two meanings:

Get the last item (byte in a string)

Get item -1 (in an array)

Yes, sure. But if this feature is documented well, I can't see any
problems with this, especially if the trend goes towards a more
typed language where the user knows about the used data-type.

Which are to different things.

Currently we treat

$a{$o} and $a[$o]

as equal. My suggestion was to split this up to avoid the conflict from
above. I didn't suggest adding support for $a[-1] as last element for
arrays, I know quite well why this won't make sense.

I know about the equality of the two bracket forms. But I read somewhere
that
the trend goes towards [] - and maybe it was something from you.

johannes

2011/6/20 Johannes Schlüter johannes@schlueters.de

Negative string offsets is a wish and also an implementation of my
running
PHP version for long. It operates in the same fashion like
substr()
with
negative offsets, but avoids the function call and is much smarter
if
one
single character has to be extracted:

Do you mean ArrayObject? ArrayAccess is the interface.
Regardless, I don't believe it makes sense to change the semantics of
those indexes for arrays, since arrays can define negative indexes.
i.e. $a = array(-1 => "foo", 2 => "bar"); $a[-1] should really be
"foo", and not "bar".

This clearly shows the inconsistency this brings. Maybe $var{$offset}
should be clearly deprecated for arrays and $var[$offset] for strings
as
in PHP they work differently.

johannes

14 years ago by Todd Ruth — view source

unread

Adding to John Crenshaw's list of reasons not to implicitly
treat strings as arrays in foreach loops... Please keep in
mind the following valid code:

$s = 'hello';
foreach ((array)$s as $x) {
var_dump($x);
}

The result is:
string(5) "hello"

That behavior can be handy. Hopefully, a BC break wouldn't
occur if any of the string features currently being discussed
are implemented. Without a BC break, having strings implicitly
be treated as arrays in foreach loops will seem very strange
in cases like the above.

Iterators are nice. Having a "text_string_to_array" function
would also be fine. For example:

$s = 'hello';
foreach (text_string_to_array($s) as $x) {
var_dump($x);
}

The result would be:
string(1) "h"
string(1) "e"
string(1) "l"
string(1) "l"
string(1) "o"

I don't know enough about the internals to say for sure, but
perhaps text_string_to_array() could be implemented as creating
a reference to the string that has a flag set that causes
it to be allowed to be treated as an array. (A full conversion
might be needed it were written to. For example,
$a = text_string_to_array($s); $a[0] = 5; )

Todd

14 years ago by Stas Malyshev — view source

unread

Hi!

Iterators are nice. Having a "text_string_to_array" function
would also be fine. For example:

$s = 'hello';
foreach (text_string_to_array($s) as $x) {
var_dump($x);
}

text_to_array($s) == str_split($s, 1)

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by Anthony Ferrara — view source

unread

text_to_array($s) == str_split($s, 1)

No, because str_split always splits into 1 byte chunks. text_to_array
would take the character set into account (or that's where the utility
in it would be)...

Hi!

Iterators are nice. Having a "text_string_to_array" function
would also be fine. For example:

$s = 'hello';
foreach (text_string_to_array($s) as $x) {
var_dump($x);
}

text_to_array($s) == str_split($s, 1)

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by John Crenshaw — view source

unread

-----Original Message-----
From: Anthony Ferrara [mailto:ircmaxell@gmail.com]

text_to_array($s) == str_split($s, 1)

No, because str_split always splits into 1 byte chunks. text_to_array
would take the character set into account (or that's where the utility
in it would be)...

I think this does what you want:

function utf8_str_split($s)
{
return preg_split("/(.)/u", $s, null, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
}

John Crenshaw
Priacta, Inc.

14 years ago by Derick Rethans — view source

unread

text_to_array($s) == str_split($s, 1)

No, because str_split always splits into 1 byte chunks. text_to_array
would take the character set into account (or that's where the utility
in it would be)...

No, as PHP currently does NOT know about character sets. If you want
character set, we need Unicode strings like we had in the "PHP6" branch.

Derick

14 years ago by Todd Ruth — view source

unread

Hi!

Iterators are nice. Having a "text_string_to_array" function
would also be fine. For example:

$s = 'hello';
foreach (text_string_to_array($s) as $x) {
var_dump($x);
}

text_to_array($s) == str_split($s, 1)

Does that have approximately the same performance as marking the
string as being OK to use as an array? For example,

$s = file_get_contents($big_file);
foreach (str_split($s, 1) as $x) {
f($x);
}

Are there performance issues with the above compared to:

$s = file_get_contents($big_file);
foreach (text_string_to_array($s) as $x) {
f($x);
}

assuming text_string_to_array could be implemented as marking
the string OK to use as an array.

Again, I don't know enough about the internals. I'm just imagining
a significant difference for very long strings between:
$a1 = text_to_array('hello');
and
$a2 = array('h','e','l','l','o');

$a1 and $a2 could act identically until a set occurred. For example,
"$a1['key'] = 5;" would first trigger $a1 becoming just like $a2 so
that the set could take place.

Any string that has not been hit with text_string_to_array would lead
to all the usual error messages some of us know and love and
any string that has been hit with text_string_to_array would allow all
the fancy features some people are seeking. I'm trying to find a
way to please the people that want strings to act like arrays without
ruining the day for those of us who are glad strings don't act like
arrays.

Todd

14 years ago by Robert Eisele — view source

unread

I really like the ideas shared here. It's a thing of consideration that
array-functions should also work with strings. Maybe this would be the way
to go, but I'm more excited about the OOP implementation of TextIterator and
ByteIterator, which solves the whole problem at once (and is easier to
implement, as mentioned by Stas). As Jonathan said, Database results with a
certain encoding could get iterated, too. The only way to workaround the
Text/Byte problem would be, offsetting >EVERY< string with 1-2 byte
"string-type" information or an additional type flag in the zval-strcuture.
Handling everything with zval's instead of objects would have the advantage,
that database-layers like mysqlnd could write the database-encoding directly
into the zval and the user had no need to decide what encoding is used.

A new casting operator (binary) could then cast the string to a 1-byte
array. But this is syntactical sugar over OOP-implementations - I don't know
which one is the better choice.

For example:

$utf8_string = "Jägermeister"; // information of utf8 ist stored in the zval

foreach ($utf8_string as $k => $v) // would iterate in byte mode

foreach ((binary)$utf8_string as $k => $v) // would iterate in text mode

over this:
$utf8_obj = new ByteIterator("Jägermeister");

foreach ($utf8_obj as $k => $v)

foreach ($utf8_obj->toText() as $k => $v)

I think the first one is easier and would be nicer to average developers
(and lazy programmers like me ;o) )

Todd, I don't like neither str_split() nor text_string_to_array(). Sure,
str_split could be optimized to return a different more optimized result
inside of foreach() but I would use rather one of the implementations,
mentioned above.

2011/6/20 Todd Ruth truth@proposaltech.com

Hi!

Iterators are nice. Having a "text_string_to_array" function
would also be fine. For example:

$s = 'hello';
foreach (text_string_to_array($s) as $x) {
var_dump($x);
}

text_to_array($s) == str_split($s, 1)

Does that have approximately the same performance as marking the
string as being OK to use as an array? For example,

$s = file_get_contents($big_file);
foreach (str_split($s, 1) as $x) {
f($x);
}

Are there performance issues with the above compared to:

$s = file_get_contents($big_file);
foreach (text_string_to_array($s) as $x) {
f($x);
}

assuming text_string_to_array could be implemented as marking
the string OK to use as an array.

Again, I don't know enough about the internals. I'm just imagining
a significant difference for very long strings between:
$a1 = text_to_array('hello');
and
$a2 = array('h','e','l','l','o');

$a1 and $a2 could act identically until a set occurred. For example,
"$a1['key'] = 5;" would first trigger $a1 becoming just like $a2 so
that the set could take place.

Any string that has not been hit with text_string_to_array would lead
to all the usual error messages some of us know and love and
any string that has been hit with text_string_to_array would allow all
the fancy features some people are seeking. I'm trying to find a
way to please the people that want strings to act like arrays without
ruining the day for those of us who are glad strings don't act like
arrays.

Todd

14 years ago by Tomas Kuliavas — view source

unread

2011.06.20 21:38 Robert Eisele rašė:

I really like the ideas shared here. It's a thing of consideration that
array-functions should also work with strings. Maybe this would be the way
to go, but I'm more excited about the OOP implementation of TextIterator
and
ByteIterator, which solves the whole problem at once (and is easier to
implement, as mentioned by Stas). As Jonathan said, Database results with
a
certain encoding could get iterated, too. The only way to workaround the
Text/Byte problem would be, offsetting >EVERY< string with 1-2 byte
"string-type" information or an additional type flag in the
zval-strcuture.
Handling everything with zval's instead of objects would have the
advantage,
that database-layers like mysqlnd could write the database-encoding
directly
into the zval and the user had no need to decide what encoding is used.

A new casting operator (binary) could then cast the string to a 1-byte
array. But this is syntactical sugar over OOP-implementations - I don't
know
which one is the better choice.

For example:

$utf8_string = "Jägermeister"; // information of utf8 ist stored in the
zval

foreach ($utf8_string as $k => $v) // would iterate in byte mode

foreach ((binary)$utf8_string as $k => $v) // would iterate in text mode

over this:
$utf8_obj = new ByteIterator("Jägermeister");

foreach ($utf8_obj as $k => $v)

foreach ($utf8_obj->toText() as $k => $v)

I think the first one is easier and would be nicer to average developers
(and lazy programmers like me ;o) )

You assume that string is in utf-8. It can be some iso-8859-x,
iso-2022-xx, utf-7, utf-16 or any other multibyte character set.

If you want to parse string in symbols, use mb_substr and mb_strlen, set
charset correctly and make sure that your string is in correct character
set, if PHP bug about inconsistent symbol position calculation is still
unfixed.

--
Tomas

14 years ago by johannes@schlueters.de — view source

unread

I really like the ideas shared here. It's a thing of consideration that
array-functions should also work with strings. Maybe this would be the way
to go, but I'm more excited about the OOP implementation of TextIterator and
ByteIterator, which solves the whole problem at once (and is easier to
implement, as mentioned by Stas). As Jonathan said, Database results with a
certain encoding could get iterated, too. The only way to workaround the
Text/Byte problem would be, offsetting >EVERY< string with 1-2 byte
"string-type" information or an additional type flag in the zval-strcuture.
Handling everything with zval's instead of objects would have the advantage,
that database-layers like mysqlnd could write the database-encoding directly
into the zval and the user had no need to decide what encoding is used.

Welcome back to the failed PHP 6 Unicode project. ;-)
(while we didn't store the original encoding but converted to Utf-16,
which prevents random/strange conversions in other places when mixing
encodings)

johannes

14 years ago by Robert Eisele — view source

unread

And what actually failed? The idea seams straightforward.

Robert

2011/6/20 Johannes Schlüter johannes@schlueters.de

I really like the ideas shared here. It's a thing of consideration that
array-functions should also work with strings. Maybe this would be the
way
to go, but I'm more excited about the OOP implementation of TextIterator
and
ByteIterator, which solves the whole problem at once (and is easier to
implement, as mentioned by Stas). As Jonathan said, Database results
with a
certain encoding could get iterated, too. The only way to workaround the
Text/Byte problem would be, offsetting >EVERY< string with 1-2 byte
"string-type" information or an additional type flag in the
zval-strcuture.
Handling everything with zval's instead of objects would have the
advantage,
that database-layers like mysqlnd could write the database-encoding
directly
into the zval and the user had no need to decide what encoding is used.

Welcome back to the failed PHP 6 Unicode project. ;-)
(while we didn't store the original encoding but converted to Utf-16,
which prevents random/strange conversions in other places when mixing
encodings)

johannes

14 years ago by Ferenc Kovacs — view source

unread

2011/6/21 Robert Eisele robert@xarg.org

And what actually failed? The idea seams straightforward.

Robert

http://www.slideshare.net/andreizm/the-good-the-bad-and-the-ugly-what-happened-to-unicode-and-php-6

to my understanding: in retrospective the utf-16 wasn't the best idea, it
caused more conversion that it seemed necessary beforehand, and many of the
core devs lacked the vison and/or the technical knowledge about the unicode
stuff, the adoption of the support for unicode strings was much slower than
expected.

Tyrael

14 years ago by Stas Malyshev — view source

unread

Hi!

2011/6/21 Robert Eisele <robert@xarg.org mailto:robert@xarg.org>
And what actually failed? The idea seams straightforward.

Robert
http://www.slideshare.net/andreizm/the-good-the-bad-and-the-ugly-what-happened-to-unicode-and-php-6

Also you may want to read this:
http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

to understand why "the idea" is not straightforward as it seems. Yes,
it's about Perl and UTF-8, but gives some impression about the number of
issues that need to be handled. There are many PHP-specific ones on top
of that (think databases, streams, filesystems, etc.) which would be
expected to work out of the box if we declare Unicode support.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by johannes@schlueters.de — view source

unread

2011/6/20 Johannes Schlüter johannes@schlueters.de

Yes. So having this in the current form accepted means that

$a[-1];

can have two meanings:

Get the last item (byte in a string)

Get item -1 (in an array)

Yes, sure. But if this feature is documented well, I can't see any
problems with this, especially if the trend goes towards a more
typed language where the user knows about the used data-type.

I consider having exact same semantics for two quite different operations
bad.
And I don't buy the "more typed language" argument.

What does
echo my_cool_function()[-1];
do?

Which are to different things.

Currently we treat

$a{$o} and $a[$o]

as equal. My suggestion was to split this up to avoid the conflict from
above. I didn't suggest adding support for $a[-1] as last element for
arrays, I know quite well why this won't make sense.

I know about the equality of the two bracket forms. But I read somewhere
that
the trend goes towards [] - and maybe it was something from you.

There's no clear trend on this. It's a back and forth. Always depends on
the features we think about.

johannes

14 years ago by Felipe Pena — view source

unread

Hi,

2011/6/20 Robert Eisele robert@xarg.org

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

$str[0] == "H"
$str[-1] == "o";

If -6 is used as offset, the old warning is displayed because it's the
first
undefined negative offset.

The same thing for setting:

$str[-1] = '0';
$str[-4] = "4";

will result in "H4ll0"

Would be glad to see this in 5.4

Robert

I like this one, +1.

--
Regards,
Felipe Pena

14 years ago by Jordi Boggiano — view source

unread

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

$str[0] == "H"
$str[-1] == "o";

If -6 is used as offset, the old warning is displayed because it's the first
undefined negative offset.

The same thing for setting:

$str[-1] = '0';
$str[-4] = "4";

will result in "H4ll0"

Would be glad to see this in 5.4

While this in itself is a good thing, I'd prefer to wait some more and
get a well thought-through, full fledged solution supporting ranges i.e.
$str[-1:2] or $str[-1,2].

I believe there were talks of such syntax a few years ago, maybe using
{} instead of []. I mean, right now both [] and {} seem to work equally
on strings and arrays, but changing {} to make it behave more like
substr/array_slice might be a viable BC break (for the negative numbers
that might exist in arrays that is).

Cheers

--
Jordi Boggiano
@seldaek - http://nelm.io/jordi

14 years ago by Robert Eisele — view source

unread

I would push this out in two steps. First: Negative string offset and later
range/slice
support for objects and strings. Objects would need a new magic method,
e.g. "__slice()",strings need a substr() like interface. I think both are
accessed the
same way, but way are different. The slice support is furthermore much more
comprehensive and needs more testing and so on.

BTW: I can dimly remember that {} vs [] was already concluded in favor of []
for string access.

Robert

2011/6/20 Jordi Boggiano j.boggiano@seld.be

Negative string offsets is a wish and also an implementation of my
running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

$str[0] == "H"
$str[-1] == "o";

If -6 is used as offset, the old warning is displayed because it's the
first
undefined negative offset.

The same thing for setting:

$str[-1] = '0';
$str[-4] = "4";

will result in "H4ll0"

Would be glad to see this in 5.4

While this in itself is a good thing, I'd prefer to wait some more and
get a well thought-through, full fledged solution supporting ranges i.e.
$str[-1:2] or $str[-1,2].

I believe there were talks of such syntax a few years ago, maybe using
{} instead of []. I mean, right now both [] and {} seem to work equally
on strings and arrays, but changing {} to make it behave more like
substr/array_slice might be a viable BC break (for the negative numbers
that might exist in arrays that is).

Cheers

--
Jordi Boggiano
@seldaek - http://nelm.io/jordi

14 years ago by Ilia Alshanetsky — view source

unread

+1, seems useful.

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

$str[0] == "H"
$str[-1] == "o";

If -6 is used as offset, the old warning is displayed because it's the first
undefined negative offset.

The same thing for setting:

$str[-1] = '0';
$str[-4] = "4";

will result in "H4ll0"

Would be glad to see this in 5.4

Robert

14 years ago by Stas Malyshev — view source

unread

Hi!

Negative string offsets is a wish and also an implementation of my running
PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

Sounds OK, but what would happen if I do $str[-10] = '?'; ?

Would be glad to see this in 5.4

For that you'll need RFC with attached patch ready quite soon.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by Robert Eisele — view source

unread

2011/6/20 Stas Malyshev smalyshev@sugarcrm.com

Hi!

Negative string offsets is a wish and also an implementation of my running

PHP version for long. It operates in the same fashion like substr() with
negative offsets, but avoids the function call and is much smarter if one
single character has to be extracted:

$str = "Hallo";

Sounds OK, but what would happen if I do $str[-10] = '?'; ?

As I wrote:

If -6 is used as offset, the old warning is displayed because it's the
first
undefined negative offset.

Would be glad to see this in 5.4

For that you'll need RFC with attached patch ready quite soon.

I'll attach a patch in 2 days (still have to wait for the new power cable of
my macbook)

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227