Hi everyone,
We've discussed this a few times in the past and it's time to make a
final decision about its removal.
I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.
If there are no serious objections I'll create a patch and get this
done as soon as possible
Scott
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.
unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.
If setting is removed, instead of maintaining at least some bits of
backwards compatibility and doing some additional work, you force massive
code rewrites in scripts that depend on working charset support and more
work for people, who use interpreter.
Every time somebody proposes removal of this setting, they claim that
majority agreed on it when there is no agreement on anything. People only
defended own positions and we had other flame about unicode_semantics.
--
Tomas
Tomas Kuliavas wrote:
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.If setting is removed, instead of maintaining at least some bits of
backwards compatibility and doing some additional work, you force massive
code rewrites in scripts that depend on working charset support and more
work for people, who use interpreter.Every time somebody proposes removal of this setting, they claim that
majority agreed on it when there is no agreement on anything. People only
defended own positions and we had other flame about unicode_semantics.
There has been agreement by the people that actually contribute towards
the development of PHP.
It certainly doesn't give backwards compatability, you are able to turn
it off in php.ini and its going to mean that developers will need to
maintain two versions. One for it off and the other for on.
My biggest concern is the 2 code bases that need to be maintained by the
PHP developers, you need to have two branches for handling unicode and
native strings.
To sum it up, unicode_semantics is in the exact same vain as
ze1_compatability and it was a complete failure.
Before any developers decide they need to port things to PHP 6 we need
to just make it Unicode only.
Scott
My biggest concern is the 2 code bases that need to be maintained by the
PHP developers, you need to have two branches for handling unicode and
native strings.
To sum it up, unicode_semantics is in the exact same vain as
ze1_compatability and it was a complete failure.
Totally agree!
Before any developers decide they need to port things to PHP 6 we need to
just make it Unicode only.
I have some internal applications that I am happy to try porting to PHP 6 to
see the outcome and list any issues, I was waiting for this switch to be
removed first though... If I have time I might try and do it before but
currently I'm pretty snowed under currently.
Regards
Marco
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.
Why don't you go ahead and make a list of those exacty issues then? We
can then see how to fix those issues. That's much more useful then just
posting to the mailinglist when you don't agree with something. From
what I've seen with my code base, the changes that I have to do are
minimal once some (internal) functions are fixed up.
regards,
Derick
--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.Why don't you go ahead and make a list of those exacty issues then? We
can then see how to fix those issues. That's much more useful then just
posting to the mailinglist when you don't agree with something. From
what I've seen with my code base, the changes that I have to do are
minimal once some (internal) functions are fixed up.
If I remain silent, others will have arguments that "everybody agrees on
removal of unicode_semantics".
I write and maintain charset decoding and encoding functions.
unicode_semantics breaks every mapping table and other functions that
operate with binary 8bit strings.
In slides by Andrei Zmievski Unicode symbols are written with \u. Why are
they written with \x(hex) and (octal) in current PHP6?
<?php
echo "\xC3\200";
I am not writing U+00C3 and U+0080, I am writing U+00C0 in UTF-8.
<?php
$string = "ą";
var_dump(preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\1')-192)*64+(ord('\2')-128)).';'", $string));
for ($i=0;$i<strlen($string);$i++) {
$char = ord($string[$i]);
echo sprintf("=%02X",$char);
}
string(6) "ą" and '=C4=85' expected, if "ą" is written in UTF-8.
I can bypass it by adding one line to every script that operates with
binary strings, but where are warranties that you won't dump declare()
support just like you dump unicode_semantics. What happens to your new
Unicode aware string functions, if I lie about strings' charset to PHP
interpreter? mb_strlen can't calculate correct $string length even when I
set correct charset in mb_strlen()
arguments. If above code works as I
want in PHP6 unicode_semantics=on, mb_strlen($string,'utf-8') returns 2
and not 1.
--
Tomas
Tomas Kuliavas wrote:
<snip>We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.
unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.
Why don't you go ahead and make a list of those exacty issues then? We
can then see how to fix those issues. That's much more useful then just
posting to the mailinglist when you don't agree with something. From
what I've seen with my code base, the changes that I have to do are
minimal once some (internal) functions are fixed up.If I remain silent, others will have arguments that "everybody agrees on
removal of unicode_semantics".
I can bypass it by adding one line to every script that operates with
binary strings, but where are warranties that you won't dump declare()
support just like you dump unicode_semantics. What happens to your new
Unicode aware string functions, if I lie about strings' charset to PHP
interpreter? mb_strlen can't calculate correct $string length even when I
set correct charset inmb_strlen()
arguments. If above code works as I
want in PHP6 unicode_semantics=on, mb_strlen($string,'utf-8') returns 2
and not 1.
That sounds like just the sort of edge case that Derick is suggesting needs
logging for fixing up. unicode_semantics=on is just another bodge to to make
it happen rather than a solution. I think I understand your description, and
to my eyes it looks like a unicode bug that needs addressing?
We have been maintaining two code bases for a long time now - PHP4 and PHP5.
Now that PHP4 is being shelved finally those of us who have had to maintain
compatibility with PHP4 can now move on and address the problems of PHP5/PHP6
compatibility. So from MY point of view unicode_semantics=on is creating a
THIRD case to have to manage? PLEASE can someone take charge and at least get
PHP6 moving forward to a stable alpha so that we have something users can be
happy to test against!
PHP5 = code sets
PHP6 = Unicode
--
Lester Caine - G8HFL
Contact - http://home.lsces.co.uk/lsces/wiki/?page=contact
L.S.Caine Electronic Services - http://home.lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php
Lester Caine schrieb:
That sounds like just the sort of edge case that Derick is suggesting
needs logging for fixing up. unicode_semantics=on is just another bodge
to to make it happen rather than a solution. I think I understand your
description, and to my eyes it looks like a unicode bug that needs
addressing?
No, it's a misunderstanding of how things work that has been explained
to Tomas countless times. A unicode string consists of codepoints, not
of bytes. Having \xXX and \XXX insert bytes instead of codepoints does
not make sense, because a) That would require a defined unicode
encoding to be used, and even if that is the case b) would allow you to
insert broken data into the unicode string, so it's not a unicode string
anymore, which is a no-no. If you want to do that sort of fiddling with
binary details, use binary strings, not unicode strings.
Regards,
Stefan
Lester Caine schrieb:
That sounds like just the sort of edge case that Derick is suggesting
needs logging for fixing up. unicode_semantics=on is just another bodge
to to make it happen rather than a solution. I think I understand your
description, and to my eyes it looks like a unicode bug that needs
addressing?No, it's a misunderstanding of how things work that has been explained
to Tomas countless times. A unicode string consists of codepoints, not
of bytes. Having \xXX and \XXX insert bytes instead of codepoints does
not make sense, because a) That would require a defined unicode
encoding to be used, and even if that is the case b) would allow you to
insert broken data into the unicode string, so it's not a unicode string
anymore, which is a no-no. If you want to do that sort of fiddling with
binary details, use binary strings, not unicode strings.
I agree that it is not a bug, because I declare invalid encoding in
scripts in order to make sure that binary and unicode bytes are equal.
You haven't explained me how things work. All your explanations ask me to
use code compatible only with PHP 5.2.1+, drop code that worked fine in
older PHP versions and take away control of charset conversions. I want
backwards compatibility with PHP 5.2.0 and PHP4. I want to be able to
control charset conversions. Where are warranties that charset conversions
will work better in PHP6? In current setups it is safer to do charset
conversions internally instead of relying on PHP to do things. And I can't
drop that code entirely because Unicode implementation in PHP 5.2.1 is
dummy. It is there only to avoid E_PARSE
errors in PHP6 compatible code.
--
Tomas
Precisely.
Stefan Walk wrote:
Lester Caine schrieb:
That sounds like just the sort of edge case that Derick is suggesting
needs logging for fixing up. unicode_semantics=on is just another
bodge to to make it happen rather than a solution. I think I
understand your description, and to my eyes it looks like a unicode
bug that needs addressing?No, it's a misunderstanding of how things work that has been explained
to Tomas countless times. A unicode string consists of codepoints, not
of bytes. Having \xXX and \XXX insert bytes instead of codepoints does
not make sense, because a) That would require a defined unicode
encoding to be used, and even if that is the case b) would allow you to
insert broken data into the unicode string, so it's not a unicode string
anymore, which is a no-no. If you want to do that sort of fiddling with
binary details, use binary strings, not unicode strings.Regards,
Stefan
So from MY point of view unicode_semantics=on is creating a THIRD
case to have to manage? PLEASE can someone take charge and at least
get PHP6 moving forward to a stable alpha so that we have something
users can be happy to test against!
I think the reason why people are reluctant to "take charge" here is
just because of this setting.
regards,
Derick
--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org
Derick Rethans wrote:
So from MY point of view unicode_semantics=on is creating a THIRD
case to have to manage? PLEASE can someone take charge and at least
get PHP6 moving forward to a stable alpha so that we have something
users can be happy to test against!I think the reason why people are reluctant to "take charge" here is
just because of this setting.
And as a result nothing is happening :(
Do we need to set up some formal vote on this quite basic feature which was -
I thought - the whole basis that PHP6 was being built on?
Or do we have to wait another 5 years for PHP6 :(
Working with Unicode does require a different mindset, and THEN overloading it
by requiring complete compatibility with a non-unicode model is adding a level
of complexity that has resulted in the current stalemate?
I was ready to run with Unicode/PHP6 two years ago and run all the database
data Unicode as well, but at present things seem to be in limbo all around?
--
Lester Caine - G8HFL
Contact - http://home.lsces.co.uk/lsces/wiki/?page=contact
L.S.Caine Electronic Services - http://home.lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php
Tomas Kuliavas wrote:
If I remain silent, others will have arguments that "everybody agrees on
removal of unicode_semantics".I write and maintain charset decoding and encoding functions.
unicode_semantics breaks every mapping table and other functions that
operate with binary 8bit strings.
Just curious, do these decoding/encoding functions do something that
Unicode support won't do?
In slides by Andrei Zmievski Unicode symbols are written with \u. Why are
they written with \x(hex) and (octal) in current PHP6?
\x and (octal) inside Unicode strings are assumed to specify Unicode
characters. This is one of the contention points, since a few people
have said that they should specify individual bytes rather than
characters, but in my opinion it's kind of dangerous since it may lead
to broken/invalid Unicode strings.
<?php
echo "\xC3\200";I am not writing U+00C3 and U+0080, I am writing U+00C0 in UTF-8.
This should work fine inside binary strings..
I can bypass it by adding one line to every script that operates with
binary strings, but where are warranties that you won't dump declare()
support just like you dump unicode_semantics.
It won't get dumped. Unicode_semantics is a BC/transition switch.
declare() is crucial to proper script parsing.
What happens to your new
Unicode aware string functions, if I lie about strings' charset to PHP
interpreter?
You will get in trouble.
mb_strlen can't calculate correct $string length even when I
set correct charset inmb_strlen()
arguments. If above code works as I
want in PHP6 unicode_semantics=on, mb_strlen($string,'utf-8') returns 2
and not 1.
I don't know what mbstring does or does not with unicode_semantics
switch, since it's meant to be deprecated.
-Andrei
As far as I remember, the latest point was to remove the
unicode_semantics switch and presume that its value is always On. At the
same time we said that binary strings should probably be the default
string type (which I don't agree with), and that we need to have a test
suite to see what exactly breaks with these changes.
-Andrei
Derick Rethans wrote:
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.
unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.Why don't you go ahead and make a list of those exacty issues then? We
can then see how to fix those issues. That's much more useful then just
posting to the mailinglist when you don't agree with something. From
what I've seen with my code base, the changes that I have to do are
minimal once some (internal) functions are fixed up.regards,
Derick
As far as I remember, the latest point was to remove the
unicode_semantics switch and presume that its value is always On. At
the same time we said that binary strings should probably be the
default string type (which I don't agree with), and that we need to
have a test suite to see what exactly breaks with these changes.
yeah .. that is what i remember as well ..
one decision done .. one more to go (what the default string type will
be unicode or binary)
regards,
Lukas
Yep, we said that we'd remove the switch. Then we'd see how compatibility fairs and if we discover the upgrade path is too painful we'd consider making "" be binary string and require u"" for Unicode strings. But this was TBD depending on people's experiences and our ability to deliver an easy migration path for applications.
So for now we should remove the switch. We can do this if needed.
Andi
-----Original Message-----
From: Andrei Zmievski [mailto:andrei@gravitonic.com]
Sent: Wednesday, May 07, 2008 9:36 AM
To: Derick Rethans
Cc: Tomas Kuliavas; internals@lists.php.net
Subject: Re: [PHP-DEV] Removal of unicode_semanticsAs far as I remember, the latest point was to remove the
unicode_semantics switch and presume that its value is always On. At
the
same time we said that binary strings should probably be the default
string type (which I don't agree with), and that we need to have a test
suite to see what exactly breaks with these changes.-Andrei
Derick Rethans wrote:
We've discussed this a few times in the past and it's time to make
a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make
it
simpler.
unicode_semantics=on breaks backwards compatibility in scripts that
have
implemented multiple character set support in current PHP setups.Why don't you go ahead and make a list of those exacty issues then?
We
can then see how to fix those issues. That's much more useful then
just
posting to the mailinglist when you don't agree with something. From
what I've seen with my code base, the changes that I have to do are
minimal once some (internal) functions are fixed up.regards,
Derick
So for now we should remove the switch. We can do this if needed.
Who is "we" in this context? Zend?
Scott is already working on the removal but I'll bet he would really
appreciate help with it.
-Hannes
Yep, we said that we'd remove the switch. Then we'd see how
compatibility fairs and if we discover the upgrade path is too painful
we'd consider making "" be binary string and require u"" for Unicode
strings. But this was TBD depending on people's experiences and our
ability to deliver an easy migration path for applications. So for now
we should remove the switch. We can do this if needed.
Scott is already working on this AFAIK. And like Andrei, I'd also be
against defaulting to binary strings.
regards,
Derick
--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org
See below:
-----Original Message-----
From: Derick Rethans [mailto:derick@php.net]
Sent: Thursday, May 08, 2008 12:23 AM
To: Andi Gutmans
Cc: Andrei Zmievski; PHP Developers Mailing List
Subject: RE: [PHP-DEV] Removal of unicode_semanticsScott is already working on this AFAIK. And like Andrei, I'd also be
against defaulting to binary strings.
Great. Dmitry can help out if needed. He'll be reviewing it anyway.
I understand you are against it but as we discussed on this list a few months ago we will have to see what reality delivers when people actually start migrating applications. It's not something we should decide at this point before we are any smarter. For now we can definitely keep "" as Unicode and we'll learn how that works during the alpha/beta cycles.
We do owe our users a feasible upgrade path whether it's with automated scripts or some other way. As we figure that out it'll become more apparent what makes sense.
Andi
The easiest thing would be just to default unicode_semantics to On
internally and hide it from users. Don't remove all the UG(unicode)
checks yet, because we can test migration/compatibility with those in place.
-Andrei
Derick Rethans wrote:
Yep, we said that we'd remove the switch. Then we'd see how
compatibility fairs and if we discover the upgrade path is too painful
we'd consider making "" be binary string and require u"" for Unicode
strings. But this was TBD depending on people's experiences and our
ability to deliver an easy migration path for applications. So for now
we should remove the switch. We can do this if needed.Scott is already working on this AFAIK. And like Andrei, I'd also be
against defaulting to binary strings.regards,
Derick
On Sun, May 4, 2008 at 8:34 PM, Tomas Kuliavas
tokul@users.sourceforge.net wrote:
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.If setting is removed, instead of maintaining at least some bits of
backwards compatibility and doing some additional work, you force massive
code rewrites in scripts that depend on working charset support and more
work for people, who use interpreter.Every time somebody proposes removal of this setting, they claim that
majority agreed on it when there is no agreement on anything. People only
defended own positions and we had other flame about unicode_semantics.
It's the lesser of two evils.
If the switch stays there, every future-author of libraries/frameworks
will have to maintain 2 separate code-bases (one for
unicode_semantics=off, other for unicode_semantics=on).
On the other hand, 1 year from now it would be safe to require 5.2.1
as a minimal supported version of php, which will allow you to mark
all the strings as "binary", which will lead to eaier migration to
php-6
--
Alexey Zakhlestin
http://blog.milkfarmsoft.com/
Tomas Kuliavas wrote:
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.If setting is removed, instead of maintaining at least some bits of
backwards compatibility and doing some additional work, you force massive
code rewrites in scripts that depend on working charset support and more
work for people, who use interpreter.Every time somebody proposes removal of this setting, they claim that
majority agreed on it when there is no agreement on anything. People only
defended own positions and we had other flame about unicode_semantics.
And leaving unicode_semantics in will make it so web application
developers like myself, who distribute their applications to be
installed on people's own servers, have to write two different versions
of their software to support the switch being on or off because of the
major differences in the language based on an ini setting. Not only is
there twice the code in PHP's codebase, there's twice the code in the
codebases for people like me.
But, we've been through this discussion before. I've already stated my
opinions. +1 to removing this.
--
Jeremy Privett
C.E.O. & C.S.A.
Omega Vortex Corporation
Please note: This message has been sent with information that could be confidential and meant only for the intended recipient. If you are not the intended recipient, please delete all copies and inform us of the error as soon as possible. Thank you for your cooperation.
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.unicode_semantics=on breaks backwards compatibility in scripts that have
implemented multiple character set support in current PHP setups.If setting is removed, instead of maintaining at least some bits of
backwards compatibility and doing some additional work, you force massive
code rewrites in scripts that depend on working charset support and more
work for people, who use interpreter.
That is correct, removing The Switch does cause some backward compatibility breakage.
But The Switch does NOT fix it, that's the problem: you would still have
to fix your applications to work with unicode_semantics both OFF and ON,
i.e. it causes 2x more trouble.
Every time somebody proposes removal of this setting, they claim that
majority agreed on it when there is no agreement on anything.
The majority of active developers have agreed that the switch would cause more harm than good.
That's the fact.
--
Wbr,
Antony Dovgal
Am 05.05.2008 um 09:51 schrieb Antony Dovgal:
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no
one has produced a patch. I have a student working on unicode
conversion for the Google Summer of Code and this would help make it
simpler.
unicode_semantics=on breaks backwards compatibility in scripts that
have
implemented multiple character set support in current PHP setups.
If setting is removed, instead of maintaining at least some bits of
backwards compatibility and doing some additional work, you force
massive
code rewrites in scripts that depend on working charset support and
more
work for people, who use interpreter.That is correct, removing The Switch does cause some backward
compatibility breakage.
But The Switch does NOT fix it, that's the problem: you would still
have to fix your applications to work with unicode_semantics both
OFF and ON, i.e. it causes 2x more trouble.Every time somebody proposes removal of this setting, they claim that
majority agreed on it when there is no agreement on anything.
The majority of active developers have agreed that the switch would
cause more harm than good.
That's the fact.
And that's the word. +10000000. Lets get rid of it and move on.
David
Just use Unicode and don't even think about backward compability, because
thouse who need it most probably still are with PHP4 and MySQL 3.x
Most normal developers are for years with utf-8 for now and even wouldn't
notice it.
So +1 for pure Unicode. No switches. Lame hosting companies 100% will mess
up with this switch and will ruin everything again like it was with PHP5.
Make them pay for PHP5! ;) :D
Arvids Godjuks wrote:
Most normal developers are for years with utf-8 for now and even wouldn't
notice it.
Sorry to destroy your pipe dream but that's just not true.
- Chris
Well, at least in my country i haven't saw any normal programmer not using
unicode :)
2008/5/5 Christian Schneider cschneid@cschneid.com:
Arvids Godjuks wrote:
Most normal developers are for years with utf-8 for now and even
wouldn't
notice it.Sorry to destroy your pipe dream but that's just not true.
- Chris
Arvids Godjuks wrote:
<meta-posting> I guess that was meant to be an ironic comment but I think we should improve the signal-to-noise ration on internals again. </meta-posting>Well, at least in my country i haven't saw any normal programmer not using
unicode :)
- Chris
Hey Scott
As the most others already have posted, then from the php developers point
it would be stupid to maintain two versions of the same function unless you
wrap it all into a function that does it by itself.
And yes zend.ze1_compatibility_mode was a major failure.
+1 for removal
Kalle
----- Original Message -----
From: "Scott MacVicar" scott@macvicar.net
To: "PHP Developers Mailing List" internals@lists.php.net
Sent: Sunday, May 04, 2008 6:12 PM
Subject: [PHP-DEV] Removal of unicode_semantics
Hi everyone,
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no one
has produced a patch. I have a student working on unicode conversion for
the Google Summer of Code and this would help make it simpler.If there are no serious objections I'll create a patch and get this done
as soon as possibleScott
+1 for removal.
"Scott MacVicar" scott@macvicar.net wrote in message
news:4BD5A050-02F2-46BD-B867-FA8CA12FF1BD@macvicar.net...
Hi everyone,
We've discussed this a few times in the past and it's time to make a
final decision about its removal.I think most people have agreed that this is the way forward but no one
has produced a patch. I have a student working on unicode conversion for
the Google Summer of Code and this would help make it simpler.If there are no serious objections I'll create a patch and get this done
as soon as possibleScott