Unicode conversion exceptions and memory leaks

19 years ago by Andrei Zmievski — view source — reply

unread

If you run this code in PHP 6 right now, you will get a nice memory
leak notice.

<?php

unicode_set_error_mode(FROM_UNICODE, U_CONV_ERROR_STOP |
U_CONV_ERROR_EXCEPTION);

$u = "< \u3844 >";
try {
$s = (binary)$u;
} catch (UnicodeConversionException $e) {
}

/homes/andrei/dev/php-src/Zend/zend_unicode.c(461) : Freeing
0x013FC4E4 (1 bytes), script=e.php

From what Dmitry tells me, it is impossible to fix this with the way
our exceptions are implemented currently. I trust him on this, but at
the same time I think we need to do something. The conversions will be
happening in a lot of place and being able to catch an exception, if
one results, is an important part of the workflow. I think Dmitry
proposed that we could re-write exceptions implementation, but also
said that it would be a lot of work. I think we should seriously
consider it, though.

Thoughts?

-Andrei

19 years ago by Andi Gutmans — view source — reply

unread

As you know, exceptions were designed to be user-land and not for
internal functionality. I warned about that more exceptions were
being integrated into the extensions; and I especially warned of
ideas of integrating them into language constructs such as type hints
and conversions.
I don't have a good solution for this right now, except for adding a
lot of bulk to C extensions and the core and making them overly
complicated. Anyway, I'll discuss with Dmitry and see if he has any
ideas I didn't think of.

re: this specific case. Is it really a good idea for a type
conversion to throw an exception? People won't be expecting that from
reading the code. They'd most likely only expect methods to throw exceptions...

Andi

At 03:44 PM 4/13/2006, Andrei Zmievski wrote:

If you run this code in PHP 6 right now, you will get a nice memory
leak notice.

<?php

unicode_set_error_mode(FROM_UNICODE, U_CONV_ERROR_STOP |
U_CONV_ERROR_EXCEPTION);

$u = "< \u3844 >";
try {
$s = (binary)$u;
} catch (UnicodeConversionException $e) {
}

?>

/homes/andrei/dev/php-src/Zend/zend_unicode.c(461) : Freeing
0x013FC4E4 (1 bytes), script=e.php

From what Dmitry tells me, it is impossible to fix this with the
way our exceptions are implemented currently. I trust him on this,
but at the same time I think we need to do something. The
conversions will be happening in a lot of place and being able to
catch an exception, if one results, is an important part of the
workflow. I think Dmitry proposed that we could re-write exceptions
implementation, but also said that it would be a lot of work. I
think we should seriously consider it, though.

Thoughts?

-Andrei

19 years ago by Andrei Zmievski — view source — reply

unread

By default type conversions will just output a warning. You will get
exceptions if you set the U_CONV_ERROR_EXCEPTION flag.

-Andrei

As you know, exceptions were designed to be user-land and not for
internal functionality. I warned about that more exceptions were being
integrated into the extensions; and I especially warned of ideas of
integrating them into language constructs such as type hints and
conversions.
I don't have a good solution for this right now, except for adding a
lot of bulk to C extensions and the core and making them overly
complicated. Anyway, I'll discuss with Dmitry and see if he has any
ideas I didn't think of.

re: this specific case. Is it really a good idea for a type conversion
to throw an exception? People won't be expecting that from reading the
code. They'd most likely only expect methods to throw exceptions...

Andi

19 years ago by Andi Gutmans — view source — reply

unread

Yeah but we can't only tailor to the default. If you cast "abc" to an
integer today PHP will do the conversion (e.g. 0). I think we should
stick to that paradigm and provide users with validation methods if
they want to strictly validate...

Andi

At 03:56 PM 4/13/2006, Andrei Zmievski wrote:

By default type conversions will just output a warning. You will get
exceptions if you set the U_CONV_ERROR_EXCEPTION flag.

-Andrei

As you know, exceptions were designed to be user-land and not for
internal functionality. I warned about that more exceptions were
being integrated into the extensions; and I especially warned of
ideas of integrating them into language constructs such as type
hints and conversions.
I don't have a good solution for this right now, except for adding
a lot of bulk to C extensions and the core and making them overly
complicated. Anyway, I'll discuss with Dmitry and see if he has any
ideas I didn't think of.

re: this specific case. Is it really a good idea for a type
conversion to throw an exception? People won't be expecting that
from reading the code. They'd most likely only expect methods to
throw exceptions...

Andi

19 years ago by Andrei Zmievski — view source — reply

unread

I've had some time to think about this and Derick and I also kicked
around some ideas in a private conversation.

The situation I am talking about is really about exceptional
circumstances, such as ISO-8859-1 string being treated as a UTF-8 one
or some other condition that results in illegal sequences. This is very
different from an unassigned character condition, which is handled by
SUBST, SKIP, etc callbacks. I disagree with the notion that this is
similar to (int)"foo" example. There, we have a well defined semantics
that say "strings not starting with a number get converted to 0".
Treating ISO-8859-1 data as UTF-8 is simply invalid and bad behavior
and should not be encouraged by silently ignoring the conversion error.

Now, I understand that there is resistance to the use of exceptions in
this case and I see the point of those who are against them. My problem
is this: if we do not throw exceptions, then all we are left with is a
warning, which is not helpful if you want to determine in a
programmatic fashion whether there was a conversion error. Sure, you
can check the return value of unicode_decode(), or maybe even fread()
and such, but it does not help with casting, concatenation, and other
similar operations. So, we do need a mechanism for this and it has to
be a fairly flexible one because libraries may want to do one thing on
failure, and application itself -- another.

The best Derick and I could come up with is a user-specified conversion
error handler. It would be invoked only when the converter encounters
an illegal sequence or other serious error. The existing subst, skip,
etc error modes would still apply. The error handler signature would be
something like:

function my_handler($direction, $encoding, $string, $char_byte,

$offset) { .. }

Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during the
attempted conversion, $string is the source string that converter tried
to process, $char_byte is either failed Unicode character or byte
sequence (depending on direction), and $offset is the offset of that
character/byte sequence in the source string. The user error handler
then is free to silence the warning, throw an exception (throw
UnicodeConversionException($message, $direction, $char_byte, $offset),
or do something else. I have no yet decided whether it's a good idea to
allow user handler to continue the conversion or not. I'd rather the
conversion always stopped.

-Andrei

Yeah but we can't only tailor to the default. If you cast "abc" to an
integer today PHP will do the conversion (e.g. 0). I think we should
stick to that paradigm and provide users with validation methods if
they want to strictly validate...

19 years ago by Markus Fischer — view source — reply

unread

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrei Zmievski wrote:

The best Derick and I could come up with is a user-specified conversion
error handler. It would be invoked only when the converter encounters an
illegal sequence or other serious error. The existing subst, skip, etc
error modes would still apply. The error handler signature would be
something like:

function my_handler($direction, $encoding, $string, $char_byte,
$offset) { .. }

Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during the
attempted conversion, $string is the source string that converter tried
to process, $char_byte is either failed Unicode character or byte
sequence (depending on direction), and $offset is the offset of that
character/byte sequence in the source string. The user error handler
then is free to silence the warning, throw an exception (throw
UnicodeConversionException($message, $direction, $char_byte, $offset),
or do something else. I have no yet decided whether it's a good idea to
allow user handler to continue the conversion or not. I'd rather the
conversion always stopped.

My problem with those handlers is always that my control flow suddenly
gets completely interrupted, i.e. right in the middle the handler is
called and I've no information which class, object, function, file,
source, line, etc. This is, at best, annoying.

That is why I love those exception. I'm in control. Not PHP. Not someone
else. I decide whether I want to interrupt my control flow or not.

Now, I also understand the "but average joe doesn't know exception"
argument and also the other one from this thread that it may be just
technical too complex/not worth it.

On the other side I fail to see the real advantage of the handler,
especially when it's yet to be decided whether continuing will be
allowed or not. That's a "horror" (comes from a saying in german, don't
know if it makes sense here in english) when you develop in or with
frameworks which have strict control flows and semantics on how to
handle cases.

Maybe I exaggerate in this case, feel free to ignore it. I mean hell, I
would even love it when a failed require would throw an exception ...

thanks,

- Markus
  -----BEGIN PGP SIGNATURE-----
  Version: GnuPG v1.4.2.2 (MingW32)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFERw4b1nS0RcInK9ARAj6fAKC+cJwqRAc3uWaN1nJQ25Wg+oFJSQCdHWBJ
2oHLQs3JLHf2T26ub08H4dc=
=ADVn
-----END PGP SIGNATURE

19 years ago by Derick Rethans — view source — reply

unread

Andrei Zmievski wrote:

The best Derick and I could come up with is a user-specified conversion
error handler. It would be invoked only when the converter encounters an
illegal sequence or other serious error. The existing subst, skip, etc
error modes would still apply. The error handler signature would be
something like:

function my_handler($direction, $encoding, $string, $char_byte,
$offset) { .. }

Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during the
attempted conversion, $string is the source string that converter tried
to process, $char_byte is either failed Unicode character or byte
sequence (depending on direction), and $offset is the offset of that
character/byte sequence in the source string. The user error handler
then is free to silence the warning, throw an exception (throw
UnicodeConversionException($message, $direction, $char_byte, $offset),
or do something else. I have no yet decided whether it's a good idea to
allow user handler to continue the conversion or not. I'd rather the
conversion always stopped.

My problem with those handlers is always that my control flow suddenly
gets completely interrupted, i.e. right in the middle the handler is
called and I've no information which class, object, function, file,
source, line, etc. This is, at best, annoying.

That is why I love those exception. I'm in control. Not PHP. Not someone
else. I decide whether I want to interrupt my control flow or not.

But you can still do that, just throw the exception in the handler
yourself like andrei mentioned.

Now, I also understand the "but average joe doesn't know exception"
argument and also the other one from this thread that it may be just
technical too complex/not worth it.

But that was not the argument that Andrei was making it all here.
Besides the point you just raised there is also the following issue
which is best illustrated with an example:

http://files.derickrethans.nl/pseudoandrei

In the example with exceptions you see that the application actually
aborts large parts (the "// lots of other shit 2" after the error),
where in the error_handler case the application can just continue.

On the other side I fail to see the real advantage of the handler,
especially when it's yet to be decided whether continuing will be
allowed or not.

Imperative in Andrei's text was "to continue the conversion or not.",
and I think it should be up the the handler whether the application
should continue. Illegal sequences are about the same as including a
file which has a parse error. With this handler you can handle this
situation which is mostly a problem for normal string operations. If
your code needs to do something more special to handle possible broken
input there are plenty of functions to do so (like checking the return
value).

That's a "horror" (comes from a saying in german, don't
know if it makes sense here in english) when you develop in or with
frameworks which have strict control flows and semantics on how to
handle cases.

Maybe I exaggerate in this case, feel free to ignore it. I mean hell, I
would even love it when a failed require would throw an exception ...

But that would set a bad predecent.

Derick

--
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org

19 years ago by Markus Fischer — view source — reply

unread

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Derick Rethans wrote:

http://files.derickrethans.nl/pseudoandrei

Ah great, it was just not getting "the point". Sorry for the spam.

- Markus
  -----BEGIN PGP SIGNATURE-----
  Version: GnuPG v1.4.2.1 (MingW32)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFER4sy1nS0RcInK9ARAg8kAJ0ez95BjYDcuWmwc6JiupbvN6+0rgCgiC0C
YWZEznPw1ds1DmhU14SDSdY=
=+JwW
-----END PGP SIGNATURE

19 years ago by Andrei Zmievski — view source — reply

unread

So, no particular opinions on this, aside from Markus's? I hoped this
proposal would mollify both camps..

-Andrei

I've had some time to think about this and Derick and I also kicked
around some ideas in a private conversation.

The situation I am talking about is really about exceptional
circumstances, such as ISO-8859-1 string being treated as a UTF-8 one
or some other condition that results in illegal sequences. This is
very different from an unassigned character condition, which is
handled by SUBST, SKIP, etc callbacks. I disagree with the notion that
this is similar to (int)"foo" example. There, we have a well defined
semantics that say "strings not starting with a number get converted
to 0". Treating ISO-8859-1 data as UTF-8 is simply invalid and bad
behavior and should not be encouraged by silently ignoring the
conversion error.

Now, I understand that there is resistance to the use of exceptions in
this case and I see the point of those who are against them. My
problem is this: if we do not throw exceptions, then all we are left
with is a warning, which is not helpful if you want to determine in a
programmatic fashion whether there was a conversion error. Sure, you
can check the return value of unicode_decode(), or maybe even fread()
and such, but it does not help with casting, concatenation, and other
similar operations. So, we do need a mechanism for this and it has to
be a fairly flexible one because libraries may want to do one thing on
failure, and application itself -- another.

The best Derick and I could come up with is a user-specified
conversion error handler. It would be invoked only when the converter
encounters an illegal sequence or other serious error. The existing
subst, skip, etc error modes would still apply. The error handler
signature would be something like:

function my_handler($direction, $encoding, $string, $char_byte,
$offset) { .. }

Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during the
attempted conversion, $string is the source string that converter
tried to process, $char_byte is either failed Unicode character or
byte sequence (depending on direction), and $offset is the offset of
that character/byte sequence in the source string. The user error
handler then is free to silence the warning, throw an exception (throw
UnicodeConversionException($message, $direction, $char_byte, $offset),
or do something else. I have no yet decided whether it's a good idea
to allow user handler to continue the conversion or not. I'd rather
the conversion always stopped.

-Andrei

Yeah but we can't only tailor to the default. If you cast "abc" to an
integer today PHP will do the conversion (e.g. 0). I think we should
stick to that paradigm and provide users with validation methods if
they want to strictly validate...

19 years ago by Andi Gutmans — view source — reply

unread

I'm wondering whether it's technically feasible that any places where
such a conversion could fail would be allowed to throw an exception
(i.e. internal functions, stream handlers, INI reader, etc...)

At 02:36 PM 4/24/2006, Andrei Zmievski wrote:

So, no particular opinions on this, aside from Markus's? I hoped
this proposal would mollify both camps..

-Andrei

I've had some time to think about this and Derick and I also kicked
around some ideas in a private conversation.

The situation I am talking about is really about exceptional
circumstances, such as ISO-8859-1 string being treated as a UTF-8
one or some other condition that results in illegal sequences. This
is very different from an unassigned character condition, which is
handled by SUBST, SKIP, etc callbacks. I disagree with the notion
that this is similar to (int)"foo" example. There, we have a well
defined semantics that say "strings not starting with a number get
converted to 0". Treating ISO-8859-1 data as UTF-8 is simply
invalid and bad behavior and should not be encouraged by silently
ignoring the conversion error.

Now, I understand that there is resistance to the use of exceptions
in this case and I see the point of those who are against them. My
problem is this: if we do not throw exceptions, then all we are
left with is a warning, which is not helpful if you want to
determine in a programmatic fashion whether there was a conversion
error. Sure, you can check the return value of unicode_decode(), or
maybe even fread() and such, but it does not help with casting,
concatenation, and other similar operations. So, we do need a
mechanism for this and it has to be a fairly flexible one because
libraries may want to do one thing on failure, and application
itself -- another.

The best Derick and I could come up with is a user-specified
conversion error handler. It would be invoked only when the
converter encounters an illegal sequence or other serious error.
The existing subst, skip, etc error modes would still apply. The
error handler signature would be something like:

function my_handler($direction, $encoding, $string, $char_byte,
$offset) { .. }

Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during
the attempted conversion, $string is the source string that
converter tried to process, $char_byte is either failed Unicode
character or byte sequence (depending on direction), and $offset is
the offset of that character/byte sequence in the source string.
The user error handler then is free to silence the warning, throw
an exception (throw UnicodeConversionException($message,
$direction, $char_byte, $offset), or do something else. I have no
yet decided whether it's a good idea to allow user handler to
continue the conversion or not. I'd rather the conversion always stopped.

-Andrei

Yeah but we can't only tailor to the default. If you cast "abc" to
an integer today PHP will do the conversion (e.g. 0). I think we
should stick to that paradigm and provide users with validation
methods if they want to strictly validate...

19 years ago by Andrei Zmievski — view source — reply

unread

Right. Throwing exceptions from output handler or INI reader may not
be optimal.

What did you think about the user defined error handler though?

-Andrei

I'm wondering whether it's technically feasible that any places
where such a conversion could fail would be allowed to throw an
exception (i.e. internal functions, stream handlers, INI reader,
etc...)

19 years ago by Andrei Zmievski — view source — reply

unread

I hope that my latest commits makes both camps happy.

The handler signature is actually:

function my_handler($direction, $encoding, $char_or_byte, $offset,
$message) { .. }

-Andrei

I've had some time to think about this and Derick and I also kicked
around some ideas in a private conversation.

The situation I am talking about is really about exceptional
circumstances, such as ISO-8859-1 string being treated as a UTF-8 one
or some other condition that results in illegal sequences. This is
very different from an unassigned character condition, which is
handled by SUBST, SKIP, etc callbacks. I disagree with the notion that
this is similar to (int)"foo" example. There, we have a well defined
semantics that say "strings not starting with a number get converted
to 0". Treating ISO-8859-1 data as UTF-8 is simply invalid and bad
behavior and should not be encouraged by silently ignoring the
conversion error.

Now, I understand that there is resistance to the use of exceptions in
this case and I see the point of those who are against them. My
problem is this: if we do not throw exceptions, then all we are left
with is a warning, which is not helpful if you want to determine in a
programmatic fashion whether there was a conversion error. Sure, you
can check the return value of unicode_decode(), or maybe even fread()
and such, but it does not help with casting, concatenation, and other
similar operations. So, we do need a mechanism for this and it has to
be a fairly flexible one because libraries may want to do one thing on
failure, and application itself -- another.

The best Derick and I could come up with is a user-specified
conversion error handler. It would be invoked only when the converter
encounters an illegal sequence or other serious error. The existing
subst, skip, etc error modes would still apply. The error handler
signature would be something like:

function my_handler($direction, $encoding, $string, $char_byte,
$offset) { .. }

Where $direction is the direction of conversion (FROM_UNICODE or
TO_UNICODE), $encoding is the name of the encoding in use during the
attempted conversion, $string is the source string that converter
tried to process, $char_byte is either failed Unicode character or
byte sequence (depending on direction), and $offset is the offset of
that character/byte sequence in the source string. The user error
handler then is free to silence the warning, throw an exception (throw
UnicodeConversionException($message, $direction, $char_byte, $offset),
or do something else. I have no yet decided whether it's a good idea
to allow user handler to continue the conversion or not. I'd rather
the conversion always stopped.

-Andrei

Yeah but we can't only tailor to the default. If you cast "abc" to an
integer today PHP will do the conversion (e.g. 0). I think we should
stick to that paradigm and provide users with validation methods if
they want to strictly validate...

19 years ago by Jochem Maas — view source — reply

unread

Andi Gutmans wrote:

As you know, exceptions were designed to be user-land and not for
internal functionality. I warned about that more exceptions were being
integrated into the extensions; and I especially warned of ideas of
integrating them into language constructs such as type hints and
conversions.
I don't have a good solution for this right now, except for adding a lot
of bulk to C extensions and the core and making them overly complicated.
Anyway, I'll discuss with Dmitry and see if he has any ideas I didn't
think of.

re: this specific case. Is it really a good idea for a type conversion
to throw an exception? People won't be expecting that from reading the
code. They'd most likely only expect methods to throw exceptions...

I really enjoy using exceptions since their inclusion in php and the first
thing I thought was "ow, that going to make going 'unicode' alot harder than
the transparent transition that was originally pitched to endusers", my second
thought was "didn't the big cheeses decide that the engine wouldn't
be throwing any exceptions?"

without being disrepectful to the tons of great work going into making
php(6?) unicode native I would suggest that from the point of view of the
'average' php coder Andi's concerns are spot on.

rgds,
Jochem.