[PATCH] Bug 43477 - Unicode error mode ignored

17 years ago by Stephen Bach — view source — reply

unread

Attached is a simple proposed patch that fixes Bug 43477. Basically, the code
that set the error mode of the ICU converter was giving it an instruction
(the context parameter) to only skip or substitute if the code point was not
represented in the new encoding. However, it still was returning an error for
illegal sequences.

The test suite returns the same results with or without the patch. Test also
attached.

-Stephen Bach

17 years ago by Andrei Zmievski — view source — reply

unread

Why would we not want to stop on illegal sequences?

-Andrei

Stephen Bach wrote:

Attached is a simple proposed patch that fixes Bug 43477. Basically, the code
that set the error mode of the ICU converter was giving it an instruction
(the context parameter) to only skip or substitute if the code point was not
represented in the new encoding. However, it still was returning an error for
illegal sequences.

The test suite returns the same results with or without the patch. Test also
attached.

-Stephen Bach

17 years ago by Geoffrey Sneddon — view source — reply

unread

Why would we not want to stop on illegal sequences?

Stuff relies on things not stopping. No web browser stops on an
illegal sequence: they all use some replacement character (U+FFFD
REPLACEMENT CHARACTER in most). Sure, idealists will say that stopping
on errors will get the errors fixed, but that manifestly isn't true.
There's a huge amount of ill-formed XML out there. Shipping something
that required strings to be valid is just asking to not be used. It
breaks too much.

We already claim to have error modes that don't stop on error. This is
broken.

--
Geoffrey Sneddon
<http://gsnedders.com/

17 years ago by Stephen Bach — view source — reply

unread

I'm just suggesting that other error modes should do what they claim to do.
Stopping on an illegal sequence is fine, unless the user had called a
function telling the converter to do something else.

U_CONV_ERROR_STOP: stops on illegal character (the default)
U_CONV_ERROR_ESCAPE_*: 5 different modes that escape the illegal sequence in
various ways

Shouldn't U_CONV_ERROR_SKIP and U_CONV_ERROR_SUBST work the same way?

-Stephen

On Tuesday 18 March 2008 01:43:13 pm Andrei wrote:

Why would we not want to stop on illegal sequences?

-Andrei

Stephen Bach wrote:

Attached is a simple proposed patch that fixes Bug 43477. Basically, the
code that set the error mode of the ICU converter was giving it an
instruction (the context parameter) to only skip or substitute if the
code point was not represented in the new encoding. However, it still was
returning an error for illegal sequences.

The test suite returns the same results with or without the patch. Test
also attached.

-Stephen Bach

17 years ago by Geoffrey Sneddon — view source — reply

unread

Shouldn't U_CONV_ERROR_SKIP and U_CONV_ERROR_SUBST work the same way?

I guess: U_CONV_ERROR_SKIP is just U_CONV_ERROR_SUBST with the
substitution string as nothing, though I expect slight speed gains
could be made by keeping them separate (due to no attempt to even add
anything after coming across an invalid sequence — though the speed
gains will be very slight).

--
Geoffrey Sneddon
<http://gsnedders.com/

17 years ago by Stephen Bach — view source — reply

unread

Sorry for the ambiguity. Allow me to clarify: I meant that U_CONV_ERROR_SKIP
and U_CONV_ERROR_SUBST should work the same as the other error modes.
Otherwise, what's the point of having them?

-Stephen

On Tuesday 18 March 2008 06:02:36 pm Geoffrey wrote:

Shouldn't U_CONV_ERROR_SKIP and U_CONV_ERROR_SUBST work the same way?

I guess: U_CONV_ERROR_SKIP is just U_CONV_ERROR_SUBST with the
substitution string as nothing, though I expect slight speed gains
could be made by keeping them separate (due to no attempt to even add
anything after coming across an invalid sequence — though the speed
gains will be very slight).

--
Geoffrey Sneddon
<http://gsnedders.com/

17 years ago by Andrei Zmievski — view source — reply

unread

Okay. I'm fine with the patch.

Stephen Bach wrote:

Sorry for the ambiguity. Allow me to clarify: I meant that U_CONV_ERROR_SKIP
and U_CONV_ERROR_SUBST should work the same as the other error modes.
Otherwise, what's the point of having them?

-Stephen

On Tuesday 18 March 2008 06:02:36 pm Geoffrey wrote:

Shouldn't U_CONV_ERROR_SKIP and U_CONV_ERROR_SUBST work the same way?
I guess: U_CONV_ERROR_SKIP is just U_CONV_ERROR_SUBST with the
substitution string as nothing, though I expect slight speed gains
could be made by keeping them separate (due to no attempt to even add
anything after coming across an invalid sequence — though the speed
gains will be very slight).

--
Geoffrey Sneddon
<http://gsnedders.com/

17 years ago by Antony Dovgal — view source — reply

unread

Attached is a simple proposed patch that fixes Bug 43477. Basically, the code
that set the error mode of the ICU converter was giving it an instruction
(the context parameter) to only skip or substitute if the code point was not
represented in the new encoding. However, it still was returning an error for
illegal sequences.

The test suite returns the same results with or without the patch. Test also
attached.

Patch committed, thanks.

--
Wbr,
Antony Dovgal

17 years ago by Geoffrey Sneddon — view source — reply

unread

Attached is a simple proposed patch that fixes Bug 43477.
Basically, the code
that set the error mode of the ICU converter was giving it an
instruction
(the context parameter) to only skip or substitute if the code
point was not
represented in the new encoding. However, it still was returning an
error for
illegal sequences.

The test suite returns the same results with or without the patch.
Test also
attached.

Patch committed, thanks.

Can we test U_CONV_ERROR_SUBST too? See attached patch. Also, the bug
should be closed.

--
Geoffrey Sneddon
<http://gsnedders.com/

17 years ago by Antony Dovgal — view source — reply

unread

Patch committed, thanks.

Can we test U_CONV_ERROR_SUBST too? See attached patch. Also, the bug
should be closed.

The patch breaks the test.

Can you guys decide on what should work and how, I'll commit the patch afterwards, ok?

--
Wbr,
Antony Dovgal

17 years ago by Stephen Bach — view source — reply

unread

The original patch works as it should. Substitution with user-defined
characters only works in the "FROM_UNICODE" case. When converting to Unicode,
U+FFFD is always substituted because it is the standard substitution
character. Even the ICU library does not allow this to be changed (with good
reason).

As for the test, I think the testing script would have to be
Unicode-compatible and the test script would have to be in some UTF encoding
to verify that U+FFFD was substituted. I don't know if this is the case yet.

On Friday 21 March 2008 12:06:34 pm you wrote:

Patch committed, thanks.

Can we test U_CONV_ERROR_SUBST too? See attached patch. Also, the bug
should be closed.

The patch breaks the test.

Can you guys decide on what should work and how, I'll commit the patch
afterwards, ok?

17 years ago by Andrei Zmievski — view source — reply

unread

Where is the patch?

Geoffrey Sneddon wrote:

Attached is a simple proposed patch that fixes Bug 43477. Basically,
the code
that set the error mode of the ICU converter was giving it an
instruction
(the context parameter) to only skip or substitute if the code point
was not
represented in the new encoding. However, it still was returning an
error for
illegal sequences.

The test suite returns the same results with or without the patch.
Test also
attached.

Patch committed, thanks.

Can we test U_CONV_ERROR_SUBST too? See attached patch. Also, the bug
should be closed.

--
Geoffrey Sneddon
<http://gsnedders.com/