Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:40936 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 61568 invoked from network); 11 Oct 2008 17:43:31 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Oct 2008 17:43:31 -0000 Authentication-Results: pb1.pair.com header.from=greg@chiaraquartet.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=greg@chiaraquartet.net; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain chiaraquartet.net from 208.83.222.18 cause and error) X-PHP-List-Original-Sender: greg@chiaraquartet.net X-Host-Fingerprint: 208.83.222.18 unknown Linux 2.6 Received: from [208.83.222.18] ([208.83.222.18:54332] helo=mail.bluga.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 24/A2-46613-2C5E0F84 for ; Sat, 11 Oct 2008 13:43:31 -0400 Received: from mail.bluga.net (localhost.localdomain [127.0.0.1]) by mail.bluga.net (Postfix) with ESMTP id EE14EC10149 for ; Sat, 11 Oct 2008 10:43:26 -0700 (MST) Received: from [192.168.0.106] (unknown [76.84.4.101]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.bluga.net (Postfix) with ESMTP id 90A2CC10142 for ; Sat, 11 Oct 2008 10:43:26 -0700 (MST) Message-ID: <48F0E625.9050505@chiaraquartet.net> Date: Sat, 11 Oct 2008 12:45:09 -0500 User-Agent: Thunderbird 2.0.0.17 (X11/20080925) MIME-Version: 1.0 To: internals Mailing List X-Enigmail-Version: 0.95.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP Subject: question on how to solve major stream filter design flaw From: greg@chiaraquartet.net (Gregory Beaver) Hi, I'm grappling with a design flaw I just uncovered in stream filters, and need some advice on how best to fix it. The problem exists since the introduction of stream filters, and has 3 parts. 2 of them can probably be fixed safely in PHP 5.2+, but I think the third may require an internal redesign of stream filters, and so would probably have to be PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion appreciated on this). The first part of the bug that I encountered is best described here: http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem than this, as the attempts to cache data is dangerous any time a stream filter is attached to a stream. I should also note that the patch in this bug contains feature additions that would have to wait for PHP 5.3. I ran into this problem because I was trying to use stream filters to read in a bz2-compressed file within a zip archive in the phar extension. This was failing, and I first tracked the problem down to an attempt by php_stream_filter_append to read in a bunch of data and cache it, which caused more stuff to be passed into the bz2 decompress filter than it could handle, making it barf. After fixing this problem, I ran into the problem described in the bug above because of php_stream_fill_read_buffer doing the same thing when I tried to read the data, because I requested it return 176 decompressed bytes, and so php_stream_read passed in 176 bytes to the decompress filter. Only 144 of those bytes were actually bz2-compressed data, and so the filter barfed upon trying to decompress the remaining data (same as bug #46026, found differently). You can probably tell from my explanation that this is an extraordinarily complex problem. There's 3 inter-related problems here: 1) bz2 (and zlib) stream filter should stop trying to decompress when it reaches the stream end regardless of how many bytes it is told to decompress (easy to fix) 2) it is never safe to cache read data when a read stream filter is appended, as there is no safe way to determine in advance how much of the stream can be safely filtered. (would be easy to fix if it weren't for #3) 3) there is no clear way to request that a certain number of filtered bytes be returned from a stream, versus how many unfiltered bytes should be passed into the stream. (very hard to fix without design change) I need some advice on #3 from the original designers of stream filters and streams, as well as any experts who have dealt with this kind of problem in other contexts. In this situation, should we expect stream filters to always stop filtering if they reach the end of valid input? Even in this situation, there is potential that less data is available than passed in. A clear example would be if we requested only 170 bytes. 144 of those bytes would be passed in as the complete compressed data, and bz2.decompress would decompress all of it to 176 bytes. 170 of those bytes would be returned from php_stream_read, and 6 would have to be placed in a cache for future reads. Thus, there would need to be some way of marking the cache as valid because of this logic path: The question is what should happen when we append the second filter 'zlib.inflate' to filter the filtered data? If we clear the read buffer as we did in the first case, it will result in lost data. So, let's assume we preserve the read buffer. Then, if we perform: and assume the remaining 3 bytes expand to 8 bytes, how should the read buffer cache be handled? Should the first 3 bytes still be the filtered bzip2 decompressed data, and the last 3 replaced with the 8 bytes of decompressed zlib data? Basically, I am wondering if perhaps we need to implement a read buffer cache for each stream filter. This could solve our problem, I think. The data would be stored like so: stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the next byte for php_stream_read() bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a pointer to byte 171 as the next byte for php_stream_read() zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to byte 8 as the next byte for php_stream_read() This way, we would essentially have a stack of stream data. If the zlib filter were then removed, we could "back up" to the bzip2 filter and so on. This will allow proper read cache filling, and remove the weird ambiguities that are apparent in a filtered stream. I don't think we would need to worry about backwards compatibility here, as the most common use case would be unaffected by this change, and the use case it would fix has never actually worked. I haven't got a patch for this yet, but it would be easy to do if the logic is sound. Thanks, Greg