Hi,
I'm grappling with a design flaw I just uncovered in stream filters, and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).
The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP 5.3.
I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down to an
attempt by php_stream_filter_append to read in a bunch of data and cache
it, which caused more stuff to be passed into the bz2 decompress filter
than it could handle, making it barf. After fixing this problem, I ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter. Only 144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug #46026,
found differently).
You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems here:
- bz2 (and zlib) stream filter should stop trying to decompress when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix) - it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3) - there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes should
be passed into the stream. (very hard to fix without design change)
I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input?
Even in this situation, there is potential that less data is available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete compressed
data, and bz2.decompress would decompress all of it to 176 bytes. 170
of those bytes would be returned from php_stream_read, and 6 would have
to be placed in a cache for future reads. Thus, there would need to be
some way of marking the cache as valid because of this logic path:
<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>
The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:
<?php
$c = fread($a, 7);
?>
and assume the remaining 3 bytes expand to 8 bytes, how should the read
buffer cache be handled? Should the first 3 bytes still be the filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?
Basically, I am wondering if perhaps we need to implement a read buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:
stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
byte 8 as the next byte for php_stream_read()
This way, we would essentially have a stack of stream data. If the zlib
filter were then removed, we could "back up" to the bzip2 filter and so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case it
would fix has never actually worked.
I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.
Thanks,
Greg
Hello Gregory,
Saturday, October 11, 2008, 7:45:09 PM, you wrote:
Hi,
I'm grappling with a design flaw I just uncovered in stream filters, and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).
The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP 5.3.
I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down to an
attempt by php_stream_filter_append to read in a bunch of data and cache
it, which caused more stuff to be passed into the bz2 decompress filter
than it could handle, making it barf. After fixing this problem, I ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter. Only 144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug #46026,
found differently).
You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems here:
- bz2 (and zlib) stream filter should stop trying to decompress when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)- it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3)- there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes should
be passed into the stream. (very hard to fix without design change)
I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input?
Even in this situation, there is potential that less data is available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete compressed
data, and bz2.decompress would decompress all of it to 176 bytes. 170
of those bytes would be returned from php_stream_read, and 6 would have
to be placed in a cache for future reads. Thus, there would need to be
some way of marking the cache as valid because of this logic path:
<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>>
The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:
<?php
$c = fread($a, 7);
?>>
and assume the remaining 3 bytes expand to 8 bytes, how should the read
buffer cache be handled? Should the first 3 bytes still be the filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?
Basically, I am wondering if perhaps we need to implement a read buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:
stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
byte 8 as the next byte for php_stream_read()
This way, we would essentially have a stack of stream data. If the zlib
filter were then removed, we could "back up" to the bzip2 filter and so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case it
would fix has never actually worked.
I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.
Thanks,
Greg
The filters need an input state and an output state. And they need to
respect the the fact that number of requested bytes does not necessarily
mean reading the same amount of data on the other end. In fact the output
side does generally not know whether less, the same amount or more input is
to be read. But this can be dealt with inside the filter. However the
filters should return the number input vs the number of output filters
always independently. Regarding states we would be interested if reaching
EOD on the input state meant reaching EOD on the output side prior to the
requested amount, at the requested amount or not at all yet (more data
available).
Best regards,
Marcus
Marcus Boerger wrote:
The filters need an input state and an output state. And they need to
respect the the fact that number of requested bytes does not necessarily
mean reading the same amount of data on the other end. In fact the output
side does generally not know whether less, the same amount or more input is
to be read. But this can be dealt with inside the filter. However the
filters should return the number input vs the number of output filters
always independently. Regarding states we would be interested if reaching
EOD on the input state meant reaching EOD on the output side prior to the
requested amount, at the requested amount or not at all yet (more data
available).
Hi,
All of this makes sense, but how to implement it? Currently, there is
no clear way to answer questions like:
- how much data did the filter consume, and how much is left over?
- how much data is left in the original stream prior to filtering?
Perhaps this could be implemented by overloading ftell()
so that it can
accept a filter resource as well as a stream resource? Also, a new
function could be implemented:
stream_filter_datasize(resource $filter) // I'm open to better names
that would either return the total size of filtered data available, or
false if this is indeterminate. This would allow:
<?php
$startoffile = 123;
$a = fopen('blah.zip');
fseek($a, $startoffile);
$filter = stream_filter_append($a, 'bzip2.decompress');
$b = fread($a, 170);
echo ftell($a),", ",ftell($filter),",
",stream_filter_datasize($filter),"\n";
// outputs "145, 170, 176"
?>
Greg
Greg,
First of great job on the analysis of the problem. As far as I can
tell (correct me if I am wrong) the problem lies in extraordinary
complex use of stream filters, with >2 stacking
such as phar://zip://bz2:// (for example). Since I'd wager that this
is not a common use care, I would prefer to refrain from any major
changes for this for the 5.2 release. Whatever simple solutions we can
implement such as fix for #1 (which appears to have already been
committed) are more then welcome, however my call is that #2 and #3
wait for 5.3, unless a simple and reliable solution is found.
Hi,
I'm grappling with a design flaw I just uncovered in stream filters,
and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can
probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a
stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP
5.3.I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down
to an
attempt by php_stream_filter_append to read in a bunch of data and
cache
it, which caused more stuff to be passed into the bz2 decompress
filter
than it could handle, making it barf. After fixing this problem, I
ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter. Only
144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug
#46026,
found differently).You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems
here:
- bz2 (and zlib) stream filter should stop trying to decompress
when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)- it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3)- there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes
should
be passed into the stream. (very hard to fix without design change)I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input?
Even in this situation, there is potential that less data is available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete
compressed
data, and bz2.decompress would decompress all of it to 176 bytes. 170
of those bytes would be returned from php_stream_read, and 6 would
have
to be placed in a cache for future reads. Thus, there would need to
be
some way of marking the cache as valid because of this logic path:<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer
cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read
buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:<?php
$c = fread($a, 7);
?>and assume the remaining 3 bytes expand to 8 bytes, how should the
read
buffer cache be handled? Should the first 3 bytes still be the
filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?Basically, I am wondering if perhaps we need to implement a read
buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a
pointer to
byte 8 as the next byte for php_stream_read()This way, we would essentially have a stack of stream data. If the
zlib
filter were then removed, we could "back up" to the bzip2 filter and
so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case
it
would fix has never actually worked.I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.Thanks,
Greg--
Ilia Alshanetsky
Ilia Alshanetsky wrote:
Greg,
First of great job on the analysis of the problem. As far as I can
tell (correct me if I am wrong) the problem lies in extraordinary
complex use of stream filters, with >2 stacking
such as phar://zip://bz2:// (for example). Since I'd wager that this
is not a common use care, I would prefer to refrain from any major
changes for this for the 5.2 release. Whatever simple solutions we can
implement such as fix for #1 (which appears to have already been
committed) are more then welcome, however my call is that #2 and #3
wait for 5.3, unless a simple and reliable solution is found.
Hi Ilia,
Actually, I ran into the problem when using a single stream filter on a
file that contained uncompressed data after the compressed data. When
filling the read buffer, PHP tries to load in an arbitrary number of
bytes. The bz2 and zlib filters were failing because they would
continue to attempt to process data after the underlying library (libbz2
or libzlib) would tell it that it was done, triggering an error. The
bug in bz2/zlib was fixed today, but the problem remains - php is
passing in more data to stream filters than it should, as it is never
safe to assume that this is an OK thing to do. The problem of chaining
filters is an example I doubt anyone has actually tried. The same
problem exists if one appends a filter to decompress some data, and then
use fseek on the filtered data when not all of the data has been
decompressed (which is actually a typical use case and one I ran into
today). To be clear, although the example was inside the phar
extension, it uses a file-based stream and stream_filter_append()
with
bzip2.decompress stream filter to decompress the contents of a file
within a zip archive, no use of the zip:// stream is done inside the
phar extension.
However, I think it is safe to assume nobody has tried this, as we would
have had a bug report before I ran into it today, so I'm fine with
fixing it in 5.3 unless this new information makes you think it is a
critical addition to 5.2 since it is essentially broken.
Greg
Hi,
Sorry about the top post, since I am CC'ing Wez and Sara again with
the tiny hope of a reaction.
It seems the streams layer is effectively unmaintained. Greg's seems
to have gotten his fingers a bit in there, but for such a critical
piece I guess it would be good to have at least 1-2 more people
familiar. I guess reviewing this issue is a good start.
Here is hoping ..
regards,
Lukas
Hi,
I'm grappling with a design flaw I just uncovered in stream filters,
and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can
probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a
stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP
5.3.I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down
to an
attempt by php_stream_filter_append to read in a bunch of data and
cache
it, which caused more stuff to be passed into the bz2 decompress
filter
than it could handle, making it barf. After fixing this problem, I
ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter. Only
144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug
#46026,
found differently).You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems
here:
- bz2 (and zlib) stream filter should stop trying to decompress
when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)- it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3)- there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes
should
be passed into the stream. (very hard to fix without design change)I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input?
Even in this situation, there is potential that less data is available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete
compressed
data, and bz2.decompress would decompress all of it to 176 bytes. 170
of those bytes would be returned from php_stream_read, and 6 would
have
to be placed in a cache for future reads. Thus, there would need to
be
some way of marking the cache as valid because of this logic path:<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer
cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read
buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:<?php
$c = fread($a, 7);
?>and assume the remaining 3 bytes expand to 8 bytes, how should the
read
buffer cache be handled? Should the first 3 bytes still be the
filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?Basically, I am wondering if perhaps we need to implement a read
buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a
pointer to
byte 8 as the next byte for php_stream_read()This way, we would essentially have a stack of stream data. If the
zlib
filter were then removed, we could "back up" to the bzip2 filter and
so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case
it
would fix has never actually worked.I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.Thanks,
Greg--
Lukas Kahwe Smith
mls@pooteeweet.org
I'm too busy to review anything anytime soon; got lots on for the next
couple of months here at work.
--Wez.
Hi,
Sorry about the top post, since I am CC'ing Wez and Sara again with
the tiny hope of a reaction.
It seems the streams layer is effectively unmaintained. Greg's seems
to have gotten his fingers a bit in there, but for such a critical
piece I guess it would be good to have at least 1-2 more people
familiar. I guess reviewing this issue is a good start.Here is hoping ..
regards,
LukasHi,
I'm grappling with a design flaw I just uncovered in stream
filters, and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can
probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper
problem
than this, as the attempts to cache data is dangerous any time a
stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP
5.3.I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down
to an
attempt by php_stream_filter_append to read in a bunch of data and
cache
it, which caused more stuff to be passed into the bz2 decompress
filter
than it could handle, making it barf. After fixing this problem, I
ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and
so
php_stream_read passed in 176 bytes to the decompress filter. Only
144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug
#46026,
found differently).You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems
here:
- bz2 (and zlib) stream filter should stop trying to decompress
when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)- it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it
weren't
for #3)- there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes
should
be passed into the stream. (very hard to fix without design change)I need some advice on #3 from the original designers of stream
filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect
stream
filters to always stop filtering if they reach the end of valid
input?
Even in this situation, there is potential that less data is
available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete
compressed
data, and bz2.decompress would decompress all of it to 176 bytes.
170
of those bytes would be returned from php_stream_read, and 6 would
have
to be placed in a cache for future reads. Thus, there would need
to be
some way of marking the cache as valid because of this logic path:<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer
cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read
buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:<?php
$c = fread($a, 7);
?>and assume the remaining 3 bytes expand to 8 bytes, how should the
read
buffer cache be handled? Should the first 3 bytes still be the
filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?Basically, I am wondering if perhaps we need to implement a read
buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:stream: 170 bytes of unfiltered data, and a pointer to byte 145 as
the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a
pointer to
byte 8 as the next byte for php_stream_read()This way, we would essentially have a stack of stream data. If the
zlib
filter were then removed, we could "back up" to the bzip2 filter
and so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use
case it
would fix has never actually worked.I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.Thanks,
Greg--
Lukas Kahwe Smith
mls@pooteeweet.org
The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a
stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP
5.3.
Can you file a bug report for the second part?
Not sure really how to proceed, I guess we need someone to step up to
help Wez/Sara with maintaining the streams layer.
regards,
Lukas Kahwe Smith
mls@pooteeweet.org
Hi,
On Saturday 11 October 2008 19:45:09 Gregory Beaver wrote:
Hi,
I'm grappling with a design flaw I just uncovered in stream filters, and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP 5.3.I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down to an
attempt by php_stream_filter_append to read in a bunch of data and cache
it, which caused more stuff to be passed into the bz2 decompress filter
than it could handle, making it barf. After fixing this problem, I ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter. Only 144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug #46026,
found differently).You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems here:
- bz2 (and zlib) stream filter should stop trying to decompress when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)- it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3)- there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes should
be passed into the stream. (very hard to fix without design change)I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input?
Even in this situation, there is potential that less data is available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete compressed
data, and bz2.decompress would decompress all of it to 176 bytes. 170
of those bytes would be returned from php_stream_read, and 6 would have
to be placed in a cache for future reads. Thus, there would need to be
some way of marking the cache as valid because of this logic path:<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:<?php
$c = fread($a, 7);
?>and assume the remaining 3 bytes expand to 8 bytes, how should the read
buffer cache be handled? Should the first 3 bytes still be the filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?Basically, I am wondering if perhaps we need to implement a read buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
byte 8 as the next byte for php_stream_read()This way, we would essentially have a stack of stream data. If the zlib
filter were then removed, we could "back up" to the bzip2 filter and so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case it
would fix has never actually worked.I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.
The problem is mainly to be able to filter a given amount of bytes, starting
at given position, and to known when this amount of bytes have been passed to
the filter.
I would propose a new argument to stream_filter_append:
stream_filter_append(stream, filter_name[, max_input_bytes])
So that only max_input_bytes will be passed to the filter. To known when this
amount of bytes have been passed to the filter I would propose to make the
stream act as a slice of the original stream: returns EOF once max_input_bytes
have been passed to the filter. Removing the filter clears the EOF flag and
allows to read again from the stream.
Your proposition of a read buffer cache for each filter would help in that,
and in making filters more robust to some use cases.
Regards,
Arnaud