Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:41965 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 79248 invoked from network); 17 Nov 2008 10:35:07 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 17 Nov 2008 10:35:07 -0000 Authentication-Results: pb1.pair.com smtp.mail=arnaud.lb@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=arnaud.lb@gmail.com; sender-id=pass; domainkeys=bad Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.128.185 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: arnaud.lb@gmail.com X-Host-Fingerprint: 209.85.128.185 fk-out-0910.google.com Received: from [209.85.128.185] ([209.85.128.185:43259] helo=fk-out-0910.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id F9/84-53741-AD841294 for ; Mon, 17 Nov 2008 05:35:06 -0500 Received: by fk-out-0910.google.com with SMTP id 18so2916355fks.7 for ; Mon, 17 Nov 2008 02:35:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:subject:date :user-agent:cc:references:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:message-id:sender; bh=9SlmxwdwzEqcf9tEiqJRKQWu6lDkh2ib/KRmUrPGJWc=; b=hJdIsfPAxjmhtbKHTMr60BnKL08ddUDd0hx/KGH/62DVNY4jkIjE4m5t2FF5ZvR+ml Ex4SzbvX8sTHfbnC2ulKu04mgbDXXCM16yi+6gyv8XGFN/n0SsbVrn2iJYkhkzHY69Di dh1asxGq+gyF/eHj8eZ2tsHnRdmWboOxWX8e4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:references:in-reply-to :mime-version:content-type:content-transfer-encoding :content-disposition:message-id:sender; b=S2HWgDUBQRLA10hZuLJR3PwaZWEjtb0/ZivG4wIcF96Iutu58X9zn273xhfNQLnkUL kYPqYKAZ6sCNJMcldK5UZ5+J8heTSBwOgh66yA9o68NXB6uQ3kPj4dM0CK0ScoEzlXSZ RM7F4z3gIm1PJueRqocb4HV5YkYHvQHCSM8rc= Received: by 10.181.202.12 with SMTP id e12mr943269bkq.138.1226918102810; Mon, 17 Nov 2008 02:35:02 -0800 (PST) Received: from 207-177-41-213.getmyip.com (207-177-41-213.getmyip.com [213.41.177.207]) by mx.google.com with ESMTPS id k29sm4094126fkk.2.2008.11.17.02.35.00 (version=SSLv3 cipher=RC4-MD5); Mon, 17 Nov 2008 02:35:01 -0800 (PST) To: internals@lists.php.net Date: Mon, 17 Nov 2008 11:34:46 +0100 User-Agent: KMail/1.10.3 (Linux/2.6.26-1-amd64; KDE/4.1.3; x86_64; ; ) Cc: Gregory Beaver References: <48F0E625.9050505@chiaraquartet.net> In-Reply-To: <48F0E625.9050505@chiaraquartet.net> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-ID: <200811171134.46407.lbarnaud@php.net> Sender: Arnaud LB Subject: Re: [PHP-DEV] question on how to solve major stream filter design flaw From: lbarnaud@php.net (Arnaud Le Blanc) Hi, On Saturday 11 October 2008 19:45:09 Gregory Beaver wrote: > Hi, > > I'm grappling with a design flaw I just uncovered in stream filters, and > need some advice on how best to fix it. The problem exists since the > introduction of stream filters, and has 3 parts. 2 of them can probably > be fixed safely in PHP 5.2+, but I think the third may require an > internal redesign of stream filters, and so would probably have to be > PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion > appreciated on this). > > The first part of the bug that I encountered is best described here: > http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem > than this, as the attempts to cache data is dangerous any time a stream > filter is attached to a stream. I should also note that the patch in > this bug contains feature additions that would have to wait for PHP 5.3. > > I ran into this problem because I was trying to use stream filters to > read in a bz2-compressed file within a zip archive in the phar > extension. This was failing, and I first tracked the problem down to an > attempt by php_stream_filter_append to read in a bunch of data and cache > it, which caused more stuff to be passed into the bz2 decompress filter > than it could handle, making it barf. After fixing this problem, I ran > into the problem described in the bug above because of > php_stream_fill_read_buffer doing the same thing when I tried to read > the data, because I requested it return 176 decompressed bytes, and so > php_stream_read passed in 176 bytes to the decompress filter. Only 144 > of those bytes were actually bz2-compressed data, and so the filter > barfed upon trying to decompress the remaining data (same as bug #46026, > found differently). > > You can probably tell from my explanation that this is an > extraordinarily complex problem. There's 3 inter-related problems here: > > 1) bz2 (and zlib) stream filter should stop trying to decompress when it > reaches the stream end regardless of how many bytes it is told to > decompress (easy to fix) > 2) it is never safe to cache read data when a read stream filter is > appended, as there is no safe way to determine in advance how much of > the stream can be safely filtered. (would be easy to fix if it weren't > for #3) > 3) there is no clear way to request that a certain number of filtered > bytes be returned from a stream, versus how many unfiltered bytes should > be passed into the stream. (very hard to fix without design change) > > I need some advice on #3 from the original designers of stream filters > and streams, as well as any experts who have dealt with this kind of > problem in other contexts. In this situation, should we expect stream > filters to always stop filtering if they reach the end of valid input? > Even in this situation, there is potential that less data is available > than passed in. A clear example would be if we requested only 170 > bytes. 144 of those bytes would be passed in as the complete compressed > data, and bz2.decompress would decompress all of it to 176 bytes. 170 > of those bytes would be returned from php_stream_read, and 6 would have > to be placed in a cache for future reads. Thus, there would need to be > some way of marking the cache as valid because of this logic path: > > $a = fopen('blah.zip'); > fseek($a, 132); // fills read buffer with unfiltered data > stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache > $b = fread($a, 170); // fills read buffer cache with 6 bytes > fseek($a, 3, SEEK_CUR); // this should seek within the filtered data > read buffer cache > stream_filter_append($a, 'zlib.inflate'); > ?> > > The question is what should happen when we append the second filter > 'zlib.inflate' to filter the filtered data? If we clear the read buffer > as we did in the first case, it will result in lost data. So, let's > assume we preserve the read buffer. Then, if we perform: > > $c = fread($a, 7); > ?> > > and assume the remaining 3 bytes expand to 8 bytes, how should the read > buffer cache be handled? Should the first 3 bytes still be the filtered > bzip2 decompressed data, and the last 3 replaced with the 8 bytes of > decompressed zlib data? > > Basically, I am wondering if perhaps we need to implement a read buffer > cache for each stream filter. This could solve our problem, I think. > The data would be stored like so: > > stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the > next byte for php_stream_read() > bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a > pointer to byte 171 as the next byte for php_stream_read() > zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to > byte 8 as the next byte for php_stream_read() > > This way, we would essentially have a stack of stream data. If the zlib > filter were then removed, we could "back up" to the bzip2 filter and so > on. This will allow proper read cache filling, and remove the weird > ambiguities that are apparent in a filtered stream. I don't think we > would need to worry about backwards compatibility here, as the most > common use case would be unaffected by this change, and the use case it > would fix has never actually worked. > > I haven't got a patch for this yet, but it would be easy to do if the > logic is sound. > The problem is mainly to be able to filter a given amount of bytes, starting at given position, and to known when this amount of bytes have been passed to the filter. I would propose a new argument to stream_filter_append: stream_filter_append(stream, filter_name[, max_input_bytes]) So that only max_input_bytes will be passed to the filter. To known when this amount of bytes have been passed to the filter I would propose to make the stream act as a slice of the original stream: returns EOF once max_input_bytes have been passed to the filter. Removing the filter clears the EOF flag and allows to read again from the stream. Your proposition of a read buffer cache for each filter would help in that, and in making filters more robust to some use cases. Regards, Arnaud