Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:22883 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 42246 invoked by uid 1010); 19 Apr 2006 21:33:35 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 42231 invoked from network); 19 Apr 2006 21:33:35 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 19 Apr 2006 21:33:35 -0000 X-Host-Fingerprint: 204.11.219.139 lerdorf.com Linux 2.5 (sometimes 2.4) (4) Received: from ([204.11.219.139:59674] helo=lerdorf.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 35/B1-19715-EACA6444 for ; Wed, 19 Apr 2006 17:33:35 -0400 Received: from [66.228.175.145] (borndress-lm.corp.yahoo.com [66.228.175.145]) (authenticated bits=0) by lerdorf.com (8.13.6/8.13.6/Debian-1) with ESMTP id k3JLXVxL005445; Wed, 19 Apr 2006 14:33:31 -0700 In-Reply-To: <7.0.1.0.2.20060413160149.03eb5d00@zend.com> References: <7.0.1.0.2.20060413154916.014b1d88@zend.com> <7.0.1.0.2.20060413160149.03eb5d00@zend.com> Mime-Version: 1.0 (Apple Message framework v623) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-ID: <409317664cec60ee0d2086846f050abf@gravitonic.com> Content-Transfer-Encoding: 7bit Cc: Dmitry Stogov , PHP Internals Date: Wed, 19 Apr 2006 14:32:46 -0700 To: Andi Gutmans X-Mailer: Apple Mail (2.623) Subject: Re: Unicode conversion exceptions and memory leaks From: andrei@gravitonic.com (Andrei Zmievski) I've had some time to think about this and Derick and I also kicked around some ideas in a private conversation. The situation I am talking about is really about exceptional circumstances, such as ISO-8859-1 string being treated as a UTF-8 one or some other condition that results in illegal sequences. This is very different from an unassigned character condition, which is handled by SUBST, SKIP, etc callbacks. I disagree with the notion that this is similar to (int)"foo" example. There, we have a well defined semantics that say "strings not starting with a number get converted to 0". Treating ISO-8859-1 data as UTF-8 is simply invalid and bad behavior and should not be encouraged by silently ignoring the conversion error. Now, I understand that there is resistance to the use of exceptions in this case and I see the point of those who are against them. My problem is this: if we do not throw exceptions, then all we are left with is a warning, which is not helpful if you want to determine in a programmatic fashion whether there was a conversion error. Sure, you can check the return value of unicode_decode(), or maybe even fread() and such, but it does not help with casting, concatenation, and other similar operations. So, we do need a mechanism for this and it has to be a fairly flexible one because libraries may want to do one thing on failure, and application itself -- another. The best Derick and I could come up with is a user-specified conversion error handler. It would be invoked only when the converter encounters an illegal sequence or other serious error. The existing subst, skip, etc error modes would still apply. The error handler signature would be something like: function my_handler($direction, $encoding, $string, $char_byte, $offset) { .. } Where $direction is the direction of conversion (FROM_UNICODE or TO_UNICODE), $encoding is the name of the encoding in use during the attempted conversion, $string is the source string that converter tried to process, $char_byte is either failed Unicode character or byte sequence (depending on direction), and $offset is the offset of that character/byte sequence in the source string. The user error handler then is free to silence the warning, throw an exception (throw UnicodeConversionException($message, $direction, $char_byte, $offset), or do something else. I have no yet decided whether it's a good idea to allow user handler to continue the conversion or not. I'd rather the conversion always stopped. -Andrei On Apr 13, 2006, at 4:02 PM, Andi Gutmans wrote: > Yeah but we can't only tailor to the default. If you cast "abc" to an > integer today PHP will do the conversion (e.g. 0). I think we should > stick to that paradigm and provide users with validation methods if > they want to strictly validate...