"Reader" as alternative to Iterator

8 years ago by Andreas Hennings — view source

unread

Hello internals,
(this is my first email to this list, hopefully I'm doing ok.)

Background / motivation:

Currently in PHP we have an interface "Iterator", and a final class
"Generator"
(and others) that implement it.

Using an iterator in foreach () is straightforward:
foreach ($iterator as $key => $value) {..}

However, if we want to iterate only a portion and then continue elsewhere
at the
position where we stopped, we need to do something like this:

for ($iterator->rewind(); $iterator->valid(); $iterator->next()) {
$value = $iterator->current();
[..]
}

This is unpleasantly verbose, and also adds performance overhead due to
additional function calls.

Also, manually writing an iterator is quite painful.

I sometimes implement "readers" that can be used like this (*):

// Gets a reader at position zero.
$reader = $readerProvider->getReader();
while (FALSE !== $value = $reader->read()) {
[..]
}

(*) Note that I am using FALSE as an equivalent for "end of data". Of
course it
would be nice if we had a dedicated constant for this, that does not
constrain
the range of possible values of the iterator.

Such readers are much easier to write than iterators.
However, there is no native support for foreach(), and for generator syntax
with
yield.

Adapters from Iterator to Reader and vice versa are possible to write.
However, such userland adapters add additional performance overhead: One
call to
->read() will trigger one call to ->valid(), one to ->current(), one to
->next().

Proposal:

Establish a new interface in core, "Reader" or "ReaderInterface" (*).
This interface has only one method, "->read()".
The existing interface Iterator will remain unchanged.

(*) I am open to other naming suggestions. In fact in my own projects I
called
this thing "StreamInterface", and distinguish between
"ObjectStreamInterface",
"RowStreamInterface" etc. Currently I think "Reader" might be more suitable.

Let the final class "Generator", and possibly other native iterators,
implement
Reader in addition to Iterator.

Optionally, add an interface "ReaderAggregate", or
"ReaderAggregateInterface",
or "ReaderProvider". This interface has only one method, "getReader()".

Let ReaderAggregate extend Traversable, and add foreach() support.
The key will simply be a counter.

Open questions:

The naming of "Reader", "ReaderAggregate", "->read()".
Which return value to use for "end of data", so that FALSE would become a
valid value.

I currently don't see a better option than FALSE.

Andreas Hennings
(https://github.com/donquixote)

8 years ago by Sara Golemon — view source

unread

Hello internals,
(this is my first email to this list, hopefully I'm doing ok.)

Welcome to php-internals!

Establish a new interface in core, "Reader" or "ReaderInterface" (*).
This interface has only one method, "->read()".
The existing interface Iterator will remain unchanged.

Rather than introduce a new kind of iteration mechanism, how about
just creating a library in userspace which provides a proxy interface:

class Reader implements Iterator {
protected $it;
public function __construct(iterable $it) { $this->it = $it; }
public function current/key/next/rewind/valid() {
$this->it->current/key/next/rewind/valid(); }
public function read() { /* What a reader should be */ }
}

Then you can use any already extant iterable via: $reader = new
Reader($iterable);
This could live in a composer/packagist library and work on already
released versions of PHP.
Or is there something I'm missing?

-Sara

8 years ago by Andreas Hennings — view source

unread

Well, on a current project I made an attempt to write different
adapters in userland.
I finally decided that the clutter is not worth.
So for this project I wrote everything as "readers", and not as iterators.

With a native solution, one could do this:

function generate() {
yield 'a';
yield 'b';
yield 'c';
}

$it = generate();

while (FALSE !== $v = $it->read()) {
print $v . "\n";
}

With a userland adapter, the code would read like this:

function generate() {
yield 'a';
yield 'b';
yield 'c';
}

$it = generate();

// This is the added line.
$reader = new IteratorReaderAdapter($it);

while (FALSE !== $v = $reader->read()) {
print $v . "\n";
}

This does add clutter and performance overhead.
In this example I'd say this is acceptable / survivable.

In the project I was working on, I have multiple layers of "readers":

One layer to read raw data from a CSV file.
One layer to re-key the rows by column labels.
One layer to turn each row into an object, with structured
properties instead of just string cells.

With the "reader" approach, for each layer I would have one "reader
provider" class and one "reader" class per layer.
(In fact I have much more, but let's keep it simple)
Generator syntax would allow to have only one class per layer.
But with the additional adapter, we again increase the number of classes.
(I wanted dedicated reader classes for different return types, e.g.
one "RowReader", one "AssocReader", one "ObjectReader". So here I
would need one adapter class per type. But let's focus on the simple
case, where you can use the same reader class.)

In addition to more classes and more function calls (performance), the
adapters also make stack traces heavier to look at.

Overall, for this project, I decided that it was not worth it.

The main purpose of the generator syntax is to simplify the code. The
need for userland adapters defeated this purpose.
Native readers would have made generators worthwhile for my use case.

The idea of "dedicated reader types" e.g. Reader<MyClass> could be
added to "Open questions"..

8 years ago by Niklas Keller — view source

unread

2017-07-03 4:49 GMT+02:00 Andreas Hennings andreas@dqxtech.net:

Well, on a current project I made an attempt to write different
adapters in userland.
I finally decided that the clutter is not worth.
So for this project I wrote everything as "readers", and not as iterators.

With a native solution, one could do this:

function generate() {
yield 'a';
yield 'b';
yield 'c';
}

$it = generate();

while (FALSE !== $v = $it->read()) {
print $v . "\n";
}

With a userland adapter, the code would read like this:

function generate() {
yield 'a';
yield 'b';
yield 'c';
}

$it = generate();

// This is the added line.
$reader = new IteratorReaderAdapter($it);

while (FALSE !== $v = $reader->read()) {
print $v . "\n";
}

This does add clutter and performance overhead.
In this example I'd say this is acceptable / survivable.

In the project I was working on, I have multiple layers of "readers":

One layer to read raw data from a CSV file.

One layer to re-key the rows by column labels.

One layer to turn each row into an object, with structured
properties instead of just string cells.

With the "reader" approach, for each layer I would have one "reader
provider" class and one "reader" class per layer.
(In fact I have much more, but let's keep it simple)
Generator syntax would allow to have only one class per layer.
But with the additional adapter, we again increase the number of classes.
(I wanted dedicated reader classes for different return types, e.g.
one "RowReader", one "AssocReader", one "ObjectReader". So here I
would need one adapter class per type. But let's focus on the simple
case, where you can use the same reader class.)

In addition to more classes and more function calls (performance), the
adapters also make stack traces heavier to look at.

Overall, for this project, I decided that it was not worth it.

The main purpose of the generator syntax is to simplify the code. The
need for userland adapters defeated this purpose.
Native readers would have made generators worthwhile for my use case.

The idea of "dedicated reader types" e.g. Reader<MyClass> could be
added to "Open questions"..

Hey Andreas,

what you're trying to do here seems to be premature optimization. While you
save a bunch of method calls, your I/O will be the actual bottleneck there.
It's entirely fine to implement such logic in userland.

Amp has a similar interface for its streams, but those have only
string|null as types. If you want to allow all values, you either need a
second method or need to wrap all values in an object.

http://amphp.org/byte-stream/#inputstream +
http://amphp.org/amp/iterators/#iterator-consumption

Regards, Niklas

8 years ago by Andreas Hennings — view source

unread

Hey Andreas,

what you're trying to do here seems to be premature optimization. While you
save a bunch of method calls, your I/O will be the actual bottleneck there.
It's entirely fine to implement such logic in userland.

I will let this stand unchallenged, until I have some reproducible data..

Amp has a similar interface for its streams, but those have only string|null
as types. If you want to allow all values, you either need a second method
or need to wrap all values in an object.

http://amphp.org/byte-stream/#inputstream +
http://amphp.org/amp/iterators/#iterator-consumption

This library looks interesting.
It seems to do a lot more than I currently need, with its concurrency approach.
I am a bit puzzled by the yield keyword in this code:

while (($chunk = yield $inputStream->read()) !== null) {
$buffer .= $chunk;
}

In my experience with generators so far, you either use yield for
sending or for receiving. So either "yield $value;" or "$value =
yield;" In this case it seems to do both.
I assume this is to achieve the concurrency.

(This is not an argument for or against anything, just an observation)

8 years ago by Niklas Keller — view source

unread

2017-07-03 17:50 GMT+02:00 Andreas Hennings andreas@dqxtech.net:

Hey Andreas,

what you're trying to do here seems to be premature optimization. While
you
save a bunch of method calls, your I/O will be the actual bottleneck
there.
It's entirely fine to implement such logic in userland.

I will let this stand unchallenged, until I have some reproducible data..

Amp has a similar interface for its streams, but those have only
string|null
as types. If you want to allow all values, you either need a second
method
or need to wrap all values in an object.

http://amphp.org/byte-stream/#inputstream +
http://amphp.org/amp/iterators/#iterator-consumption

This library looks interesting.
It seems to do a lot more than I currently need, with its concurrency
approach.
I am a bit puzzled by the yield keyword in this code:

while (($chunk = yield $inputStream->read()) !== null) {
$buffer .= $chunk;
}

In my experience with generators so far, you either use yield for
sending or for receiving. So either "yield $value;" or "$value =
yield;" In this case it seems to do both.

Indeed, it can do both at the same time.

I assume this is to achieve the concurrency.

Not exactly, it "outputs" a promise and "inputs" the resolution value or
exception once that promise resolves.

Regards, Niklas

8 years ago by Sara Golemon — view source

unread

(I wanted dedicated reader classes for different return types, e.g.
one "RowReader", one "AssocReader", one "ObjectReader". So here I
would need one adapter class per type. But let's focus on the simple
case, where you can use the same reader class.)

You need that anyway. If the current iterator returns one type and
you want to transform that into another type, then you need something
to actually do that. Having a reader interface won't magically know
that you want to change the output type.

If I'm misunderstanding that, and you're saying that the output type
of the original iterator is already different and you somehow need a
different proxy to blindly pass through the different type then... No.
You don't. A single reader adapter will handle whatever type you pass
through it.

the adapters also make stack traces heavier to look at.

That is the only slightly compelling argument I've seen so far. Not
compelling enough IMO.

The main purpose of the generator syntax is to simplify the code. The
need for userland adapters defeated this purpose.

ONE adapter, which lives in a library that you don't touch once it's
written. That doesn't defeat your purposes at all. That is, in fact,
identical to having done the implementation in the core, expect when
done in userland you have the opportunity to use it tomorrow rather
than in December, and the ability to fix any bugs immediately rather
than on a release cycle.

Native readers would have made generators worthwhile for my use case.

You've yet to demonstrate that they are not worthwhile.

The idea of "dedicated reader types" e.g. Reader<MyClass> could be
added to "Open questions"..

You have yet to demonstrate the need for dedicated and/or templatized
reader types.

-Sara

8 years ago by Andreas Hennings — view source

unread

Thanks everyone so far for the replies!
I think I need to do some "homework", and come back with benchmarks
and provide real examples.
I remember that the overhead did make a difference in performance, but
I should back that up with real data.

For now just some inline replies.

(I wanted dedicated reader classes for different return types, e.g.
one "RowReader", one "AssocReader", one "ObjectReader". So here I
would need one adapter class per type. But let's focus on the simple
case, where you can use the same reader class.)

You need that anyway. If the current iterator returns one type and
you want to transform that into another type, then you need something
to actually do that. Having a reader interface won't magically know
that you want to change the output type.

If I'm misunderstanding that, and you're saying that the output type
of the original iterator is already different and you somehow need a
different proxy to blindly pass through the different type then... No.
You don't. A single reader adapter will handle whatever type you pass
through it.

Yes, the adapters were just blind proxies.
In my own library I had dedicated reader types like
XmlElementReaderInterface, RowReaderInterface (e.g. for CSV),
AssocReaderInterface, ObjectReaderInterface, each with their own
->read() methods like ->getElement(), ->getRow(), ->getAssoc(),
->getObject().
So I ended up writing a distinct iterator adapter for each reader type.

The different interfaces were helpful to prevent using a reader of the
wrong type.
But we would lose this anyway if this was implemented in core. All the
methods would all collapse into just one method ->read().
Maybe some PhpDoc magic could help the IDE to distinguish the reader
type. Currently my IDE does not support such a thing.

8 years ago by johannes@schlueters.de — view source

unread

Hello internals,
(this is my first email to this list, hopefully I'm doing ok.)

Background / motivation:

Currently in PHP we have an interface "Iterator", and a final class
"Generator"
(and others) that implement it.

Using an iterator in foreach () is straightforward:
foreach ($iterator as $key => $value) {..}

However, if we want to iterate only a portion and then continue
elsewhere at the position where we stopped, we need to do something
like this:

for ($iterator->rewind(); $iterator->valid(); $iterator->next()) {
$value = $iterator->current();
[..]
}

This is unpleasantly verbose, and also adds performance overhead due
to additional function calls.

Wouldn't SPL's NoRewindIterator be enough?

$nit = new NoRewindIterator($it);

foreach ($nit as $row) {
break;
}
foreach ($nit as $row) {
// continues same iteration
}

Also, manually writing an iterator is quite painful.

I sometimes implement "readers" that can be used like this (*):

// Gets a reader at position zero.
$reader = $readerProvider->getReader();
while (FALSE !== $value = $reader->read()) {
[..]
}

(*) Note that I am using FALSE as an equivalent for "end of data". Of
course it would be nice if we had a dedicated constant for this, that
does not constrain the range of possible values of the iterator.

That distinction is the reason why next() and valid() are different
methods in iterators.

johannes

8 years ago by Niklas Keller — view source

unread

2017-07-03 17:27 GMT+02:00 Johannes Schlüter johannes@schlueters.de:

Hello internals,
(this is my first email to this list, hopefully I'm doing ok.)

Background / motivation:

Currently in PHP we have an interface "Iterator", and a final class
"Generator"
(and others) that implement it.

Using an iterator in foreach () is straightforward:
foreach ($iterator as $key => $value) {..}

However, if we want to iterate only a portion and then continue
elsewhere at the position where we stopped, we need to do something
like this:

for ($iterator->rewind(); $iterator->valid(); $iterator->next()) {
$value = $iterator->current();
[..]
}

This is unpleasantly verbose, and also adds performance overhead due
to additional function calls.

Wouldn't SPL's NoRewindIterator be enough?

$nit = new NoRewindIterator($it);

foreach ($nit as $row) {
break;
}
foreach ($nit as $row) {
// continues same iteration
}

Also, manually writing an iterator is quite painful.

I sometimes implement "readers" that can be used like this (*):

// Gets a reader at position zero.
$reader = $readerProvider->getReader();
while (FALSE !== $value = $reader->read()) {
[..]
}

(*) Note that I am using FALSE as an equivalent for "end of data". Of
course it would be nice if we had a dedicated constant for this, that
does not constrain the range of possible values of the iterator.

That distinction is the reason why next() and valid() are different
methods in iterators.

Not really, Iterator::next() returns void, so could as well return bool.

Regards, Niklas

8 years ago by johannes@schlueters.de — view source

unread

That distinction is the reason why next() and valid() are different
methods in iterators.

Not really, Iterator::next() returns void, so could as well return
bool.

Well, that story is a bit longer and I cut it short. Let's assume we
remove valid and use next's return value. Then we don't know if the
first element exists or not, as next is only called after the first
iteration. An alternative might be using only current() but then we
need a special, magic, return value to mark the end oder references.
Both are bad.

johannes

8 years ago by Andreas Hennings — view source

unread

On Mon, Jul 3, 2017 at 5:53 PM, Johannes Schlüter
johannes@schlueters.de wrote:

That distinction is the reason why next() and valid() are different
methods in iterators.

I would rather say this is the reason why current() and valid() are
different methods.
A combination of ->current() and next() would not cause problems.

Not really, Iterator::next() returns void, so could as well return
bool.

Well, that story is a bit longer and I cut it short. Let's assume we
remove valid and use next's return value. Then we don't know if the
first element exists or not, as next is only called after the first
iteration. An alternative might be using only current() but then we
need a special, magic, return value to mark the end oder references.
Both are bad.

The proposed ->read() method is more or less the same as ->current()
plus ->next(), if ->current() always returns FALSE on end-of-data.
An object that has both ->read() and ->valid() would remove the need
for a special magic return value.

But in many many cases, having a magic return value FALSE is totally acceptable.
Think of fgetcsv(), fread(), or PDOStatement::fetch().

So if we wanted, we could have one interface "SimpleReader" or
"FalseTerminatedReader" with only the ->read() method, and another
method "Reader" with both ->read() and ->valid().

(Sometimes I would like to have a new data type "unique symbol" for
constants that are not strings or integers. So we could say "while
(EOF !== $value = $reader->read());" but this is another topic)

johannes

8 years ago by Andreas Hennings — view source

unread

On Mon, Jul 3, 2017 at 5:53 PM, Johannes Schlüter
johannes@schlueters.de wrote:

That distinction is the reason why next() and valid() are different
methods in iterators.

I would rather say this is the reason why current() and valid() are
different methods.
A combination of ->current() and next() would not cause problems.

Not really, Iterator::next() returns void, so could as well return
bool.

Just so we are on the same page:
Changing the behavior of Iterator interface would be a BC break.
Introducing a new interface, and letting Generator implement it, would
not break anything.
(except for the possible nameclash, if the an interface with the same
name already exists in userland outside of a namespace)

I assume we already agree on this, and this was more of a thought experiment.

8 years ago by Andreas Hennings — view source

unread

On Mon, Jul 3, 2017 at 5:27 PM, Johannes Schlüter
johannes@schlueters.de wrote:

Wouldn't SPL's NoRewindIterator be enough?

$nit = new NoRewindIterator($it);

foreach ($nit as $row) {
break;
}
foreach ($nit as $row) {
// continues same iteration
}

I had not noticed this class :)

My motivation was to be able to use iterators and readers interchangeably.
Readers are easier to implement as classes.
Iterators have the benefit of the generator syntax.
The idea is to have a library where some stuff is implemented as
reader classes, and some other things are implemented as iterator
classes.

Then for the actual operational code, I need to choose to use either
iterator syntax or reader syntax. So either I need adapters for all
iterators, or I need adapters for all readers.

The NoRewindIterator would make it more viable to work with iterator
syntax on operation level.
So yes, it is a helpful hint. Not sure if it makes me fully happy.

8 years ago by Rowan Collins — view source

unread

My motivation was to be able to use iterators and readers
interchangeably.
Readers are easier to implement as classes.
Iterators have the benefit of the generator syntax.
The idea is to have a library where some stuff is implemented as
reader classes, and some other things are implemented as iterator
classes.

Then for the actual operational code, I need to choose to use either
iterator syntax or reader syntax. So either I need adapters for all
iterators, or I need adapters for all readers.

An alternative to adapters in this case might be a Trait that allows you to write the class as a Reader, but turn it into a fully-functioning Iterator with one line of code. If you always call your read method read() you only need a single adapter which calls that at the appropriate times and buffers the value to know if it's reached the end. That adapter could be "pasted" by a Trait into each class where read() is defined, and read() could even be private.

Regards,

--
Rowan Collins
[IMSoP]

8 years ago by Davey Shafik — view source

unread

I believe the correct solution to this is an OuterIterator — an
OuterIterator contains the logic to constrain the loop, for example, we
have the LimitIterator, FilterIterator, RegexIterator, etc.

See this example:
http://php.net/manual/en/class.limititerator.php#example-4473

You can build your own OuterIterators, which implements whatever logic your
Reader should have.

OuterIterators are also compose-able, you can stack them, though of course
there are performance considerations here.

I would say that any alternative to OuterIterator's should have a clear and
demonstrable benefit in both developer experience, and performance before
it should be considered.

Davey

On Wed, Jul 5, 2017 at 3:51 AM, Rowan Collins rowan.collins@gmail.com
wrote:

My motivation was to be able to use iterators and readers
interchangeably.
Readers are easier to implement as classes.
Iterators have the benefit of the generator syntax.
The idea is to have a library where some stuff is implemented as
reader classes, and some other things are implemented as iterator
classes.

Then for the actual operational code, I need to choose to use either
iterator syntax or reader syntax. So either I need adapters for all
iterators, or I need adapters for all readers.

An alternative to adapters in this case might be a Trait that allows you
to write the class as a Reader, but turn it into a fully-functioning
Iterator with one line of code. If you always call your read method read()
you only need a single adapter which calls that at the appropriate times
and buffers the value to know if it's reached the end. That adapter could
be "pasted" by a Trait into each class where read() is defined, and read()
could even be private.

Regards,

--
Rowan Collins
[IMSoP]