Hi Ilia,
During the verification of bug #26600, I just noticed
fgetcsv()
now behaves differently than the previous release 4.3.4.
After a quick examination, I found the change you made
on r-1.279.2.41 is related to this issue.
http://cvs.php.net/diff.php/php-src/ext/standard/file.c?
r1=1.279.2.41&r2=1.279.2.42&ty=h
Before the patch, the test case attached below had given the
following result:
array(3) {
[0]=>
string(1) "a"
[1]=>
string(1) "b"
[2]=>
string(1) "c"
}
Since it was patched, fgetcsv()
returns the following:
array(3) {
[0]=>
string(1) "a"
[1]=>
string(2) " b"
[2]=>
string(2) " c"
}
While I think the new behaviour is consistent with the CSV format
used by Microsoft Excel, it'd be a BC problem also.
Test case:
<?php
$file = '/tmp/test.csv';
$fp = fopen($file, 'w');
fwrite($fp, "a, b, c\n");
fclose($fp);
$fp = fopen($file, 'r');
var_dump(fgetcsv($fp, filesize($file)));
fclose($fp);
?>
What do you think of this?
Moriyoshi
What do you think of this?
I'll apply a fix momentarily, it wouldn't do to break BC in stable branch.
That said, the whole space trimming behavior seems a little unusual since it
will corrupt content especially if said content contains binary data. IMHO
the data read by fgetcsv()
should be fetched in such a manner so that the
original string can be recreated.
Ilia
That said, the whole space trimming behavior seems a little unusual since it
will corrupt content especially if said content contains binary data. IMHO
the data read byfgetcsv()
should be fetched in such a manner so that the
original string can be recreated.
I agree that it would be a good idea to provide a mechanism to do that,
but at this point I don't think we should be changing the behaviour of
fgetcsv()
in neither the stable branch nor the HEAD branch. I'd add a new
binary-safe version of the function instead for this. Or an optional arg,
but fgetcsv()
already has 2 optional args.
-Rasmus
That said, the whole space trimming behavior seems a little unusual
since it
will corrupt content especially if said content contains binary data.
IMHO
the data read byfgetcsv()
should be fetched in such a manner so that
the
original string can be recreated.I agree that it would be a good idea to provide a mechanism to do that,
but at this point I don't think we should be changing the behaviour of
fgetcsv()
in neither the stable branch nor the HEAD branch. I'd add a
new
binary-safe version of the function instead for this. Or an optional
arg,
butfgetcsv()
already has 2 optional args.
My opinion is basically the same as Ilia's. And I think it'd also be
a good idea to introduce a few more option to modify the escaping
behaviour. Escape characters like \ are treated specially
at the moment, while the de facto specification, of Microsoft,
adopts dubbed-quotes style instead of it. IMO we should be
able to choose the behaviour.
Then, where do we go from here? Eventually we'll need to change the
spec of its arguments, or add a new function as Rasmus said...
Moriyoshi
I agree that it would be a good idea to provide a mechanism to do that,
but at this point I don't think we should be changing the behaviour of
fgetcsv()
in neither the stable branch nor the HEAD branch. I'd add a new
binary-safe version of the function instead for this. Or an optional arg,
butfgetcsv()
already has 2 optional args.
I think we could add another optional argument (bitmask) that could be used to
control various capabilities of fgetcsv()
. So, if another tuneable behavior
is necessary it could be easily added without breaking BC.
On a related note I should mention that fgetcsv()
in 4.3.X is currently 2.5
times faster then it's equivalent in 5.X.
Ilia
On a related note I should mention that
fgetcsv()
in 4.3.X is
currently 2.5
times faster then it's equivalent in 5.X.
I don't know why you're mentioning this at this time,
but I can say it is a sort of necessary evil :) Because the HEAD
version is capable of handling various encodings, and
less intricate IMO. Rather, I was surprised about that result,
it's only 2.5 times slower :)
Moriyoshi
I don't know why you're mentioning this at this time,
but I can say it is a sort of necessary evil :) Because the HEAD
version is capable of handling various encodings, and
less intricate IMO. Rather, I was surprised about that result,
it's only 2.5 times slower :)
I mentioning this now because we are considering changes to the function in
the development branch, which is a fine time to resolve any deficiencies.
The added functionality, which if I understand correctly is support for
multibyte delimeters and enclosures is great. But it hardly explains a
significant performance disparity I am seeing. I believe much of the problem
can be solved by moving from manual string iteration to one using C library
functions such as memchr(). When parsing non-multibyte text there shouldn't
be more then 10-15% performance loss.
I should mention that benchmarks were made using time utility, so advantages
offered by PHP 5's speedups were discounted. Had they been considered the
speed loss would've been 300% or more.
Ilia
I mentioning this now because we are considering changes to the
function in
the development branch, which is a fine time to resolve any
deficiencies.
Okay, fine :)
The added functionality, which if I understand correctly is support for
multibyte delimeters and enclosures is great. But it hardly explains a
The change was not for multibyte delimiters and enclosures. The current
implementation still allows only single-byte characters for the
delimiter
and enclosure. I was able to add such a capability as well, but I didn't
because it appeared to fairly slow it down.
As several multibyte encodings like CP932, CP936, CP949, CP950
and Shift_JIS may map a value in range of 0x40 - 0xfe to the second
byte,
which had been a problem. Therefore we need to check if a octet of
a certain position belongs to a multibyte character or not and this
fact motivated me to bring a scanner-like finite-state machine
implementation into fgetcsv()
(and basename()
).
See http://www.microsoft.com/globaldev/reference/WinCP.mspx for detail.
significant performance disparity I am seeing. I believe much of the
problem
can be solved by moving from manual string iteration to one using C
library
functions such as memchr(). When parsing non-multibyte text there
shouldn't
be more then 10-15% performance loss.
I should mention that benchmarks were made using time utility, so
advantages
offered by PHP 5's speedups were discounted. Had they been considered
the
speed loss would've been 300% or more.
If we limited the support to UTF-8 or EUC encoding only, we'd be
able to drastically gain much better performance. But it won't
actually solve practical problems where it is in action.
Moriyoshi
How about we add mb_fgetcsv(), which would have full multi-byte support
(including delimeters). I'd imagine for people who need to parse multi-byte
csv files, full functionality is more important then speed. As for the
fgetcsv()
in ext/standard/, we can port the 4.3.X code (copy & paste really)
and let PHP 5 users benefit from a faster fgetcsv()
for common applications.
What do you think?
Ilia
How about we add mb_fgetcsv(), which would have full multi-byte support
(including delimeters). I'd imagine for people who need to parse
multi-byte
csv files, full functionality is more important then speed. As for the
fgetcsv()
in ext/standard/, we can port the 4.3.X code (copy & paste
really)
and let PHP 5 users benefit from a fasterfgetcsv()
for common
applications.
What do you think?
I disagree, because of the following reasons:
- Not a few people actually use
fgetcsv()
commonly
with multibyte characters indeed. Regarding this,
applications made by those who don't use
such characters don't (and won't) use multibyte specific
functions and that's the problem. This greatly prevents
them from being portable. - IMO speed is not a key factor here. People rather wants
trust-worthy behaviour. -
fgetcsv()
implementation in the stable branch is
now too complicated to add a new feature to
and also hard to maintain. We should be able to
eliminate the mblen() calls for acceptable performance.
See the attached result.
Moriyoshi
p.s. fgetcsv()
in the stable branch still seems to segfault with
the attached test case (segfault.php.txt).
[The benchmark result]
My code with mblen() (on php5-csv):
real 0m1.389s
user 0m1.330s
sys 0m0.060s
Ditto without mblen():
real 0m0.396s
user 0m0.350s
sys 0m0.040s
Your code (on php4-csv):
real 0m0.332s
user 0m0.270s
sys 0m0.060s
I disagree, because of the following reasons:
- Not a few people actually use
fgetcsv()
commonly
with multibyte characters indeed. Regarding this,
applications made by those who don't use
such characters don't (and won't) use multibyte specific
functions and that's the problem. This greatly prevents
them from being portable.
People have lived without multibyte support in fgetcsv()
for many years now,
and I did not see a single request on bugs.php.net for fgetcsv()
multi-byte
support. So, while this is certainly useful functionality I do not believe it
is as widely needed as you say it is. We also have a multibyte extension that
already implements multi-byte safe variants of common functions, why make
exception for fgetcsv()
and add multibyte code into core?
- IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.
When it's a few percent and the changes offer significant improvements yes,
but when were are talking about a performance loss of 250-300% or more then
performance must become a consideration as well.
fgetcsv()
implementation in the stable branch is
now too complicated to add a new feature to
and also hard to maintain. We should be able to
eliminate the mblen() calls for acceptable performance.
See the attached result.
What features are we talking about here? The only 2 features I can see we may
wish to add are >1 char long enclosures and separators and the binary thing.
Both of these features would be fairly trivial to add.
p.s.
fgetcsv()
in the stable branch still seems to segfault with
the attached test case (segfault.php.txt).
Writing a fix now, thanks for the heads-up. If you have any more please let me
know.
I disagree, because of the following reasons:
- Not a few people actually use
fgetcsv()
commonly
with multibyte characters indeed. Regarding this,
applications made by those who don't use
such characters don't (and won't) use multibyte specific
functions and that's the problem. This greatly prevents
them from being portable.People have lived without multibyte support in
fgetcsv()
for many
years now,
and I did not see a single request on bugs.php.net forfgetcsv()
multi-byte
support. So, while this is certainly useful functionality I do not
believe it
is as widely needed as you say it is. We also have a multibyte
extension that
already implements multi-byte safe variants of common functions, why
make
exception forfgetcsv()
and add multibyte code into core?
I admit that few requests are in sight, but it's the case that
those who really want that support don't advocate it in English.
And I don't think fgetcsv()
is an exception, since htmlentities()
can
be referred to as an example that is placed in core and
supports multibyte strings. As I mentioned, purging that kind of
functionality into the mbstring extension doesn't solve the problem
in practice by any means.
- IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.When it's a few percent and the changes offer significant improvements
yes,
but when were are talking about a performance loss of 250-300% or more
then
performance must become a consideration as well.
If there are virtually no ways to improve it, it'd be natural to me
we dismiss the issue.
fgetcsv()
implementation in the stable branch is
now too complicated to add a new feature to
and also hard to maintain. We should be able to
eliminate the mblen() calls for acceptable performance.
See the attached result.What features are we talking about here? The only 2 features I can see
we may
wish to add are >1 char long enclosures and separators and the binary
thing.
Both of these features would be fairly trivial to add.
One thing I'm talking about here is escaping behaviour, which I
mentioned
in the previous mail.
Moriyoshi
And I don't think
fgetcsv()
is an exception, sincehtmlentities()
can
be referred to as an example that is placed in core and
supports multibyte strings. As I mentioned, purging that kind of
functionality into the mbstring extension doesn't solve the problem
in practice by any means.
htmlentities()
is a rather special function it handles not only multibyte but
a whole lot of diffrent & unusual things. I do not think you can fairly
compare it to fgetcsv()
. We have a multibyte extension for people who need
that functionality, why force it on everyone else?
- IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.When it's a few percent and the changes offer significant improvements
yes,
but when were are talking about a performance loss of 250-300% or more
then
performance must become a consideration as well.If there are virtually no ways to improve it, it'd be natural to me
we dismiss the issue.
Why does a vast majority of users have to endure degredation in performance
for functionality that are needed by a few? It's as simple as that. Same
argument applies to basename()
.
One thing I'm talking about here is escaping behaviour, which I
mentioned in the previous mail.
I believe it would be possible to implement in the 4.3.X code, however it
sounds specific to multibyte implementation.
Ilia
And I don't think
fgetcsv()
is an exception, sincehtmlentities()
can
be referred to as an example that is placed in core and
supports multibyte strings. As I mentioned, purging that kind of
functionality into the mbstring extension doesn't solve the problem
in practice by any means.
htmlentities()
is a rather special function it handles not only
multibyte but
a whole lot of diffrent & unusual things. I do not think you can fairly
compare it tofgetcsv()
.
What are you referring to as "a whole lot of different & unusual thing"?
We have a multibyte extension for people who need
that functionality, why force it on everyone else?
Because it's a bug. The multibyte extension we have is not provided
to make easier the lives of those who don't use multibyte encodings.
It exists as an extension since we had to do so in the past.
- IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.When it's a few percent and the changes offer significant
improvements
yes,
but when were are talking about a performance loss of 250-300% or
more
then
performance must become a consideration as well.If there are virtually no ways to improve it, it'd be natural to me
we dismiss the issue.Why does a vast majority of users have to endure degredation in
performance
for functionality that are needed by a few? It's as simple as that.
Same
argument applies tobasename()
.
You should be underestimating the number of the people who actually
need it.
One thing I'm talking about here is escaping behaviour, which I
mentioned in the previous mail.I believe it would be possible to implement in the 4.3.X code, however
it
sounds specific to multibyte implementation.
Escaping behaviour is totally irrelevant to the multibyte issue.
I think users should be able to choose by an optional argument
whether " has to be treated as a escaped quote or a simple
sequence of a backslash and a quote.
Moriyoshi
Why does a vast majority of users have to endure degredation in performance
for functionality that are needed by a few? It's as simple as that. Same
argument applies tobasename()
.
Ilia, we need to try to avoid this sort of thinking. This "vast majority"
is most likely only a "vocal majority" these days. It is very likely that
the non-mb users are actually the "few" and if we continue along your way
of thinking then we need to have an ext/singlebyte that implements all
these weird singlebyte string manipulation functions.
We need to move towards a uniform platform that works for everyone without
putting undue strain on either side.
-Rasmus
Ilia, we need to try to avoid this sort of thinking. This "vast majority"
is most likely only a "vocal majority" these days. It is very likely that
the non-mb users are actually the "few" and if we continue along your way
of thinking then we need to have an ext/singlebyte that implements all
these weird singlebyte string manipulation functions.
There is a good chance you are correct. However my assumption is not without
bases, please consider the following statistic:
Google finds 185,000 (or so) phpinfo()
pages, when mbstring is added to the
search query only 8150 pages are found. That leads me to believe that 1/2% of
that userbase uses mbstring. Even if we were to say that of all the people
who have mbstring compiled use it (which is highly unlikely) it's still only
1/2%.
Ilia
There is a good chance you are correct. However my assumption is not
without
bases, please consider the following statistic:Google finds 185,000 (or so)
phpinfo()
pages, when mbstring is added
to the
search query only 8150 pages are found. That leads me to believe that
1/2% of
that userbase uses mbstring. Even if we were to say that of all the
people
who have mbstring compiled use it (which is highly unlikely) it's
still only
1/2%.
As a sidenote, this unrealistic statistics appear to be quite unreal.
phpinfo()
=> 186,000 (pages) [1]
phpinfo()
mbstring => 8,330
phpinfo()
Server API Configure Command => 16,800
phpinfo()
Server API Configure Command mbstring => 4,510
[1] includes the number of pages that are not of actual phpinfo()
and
merely contain the keyword "phpinfo".
Moriyoshi
As a sidenote, this unrealistic statistics appear to be quite unreal.
phpinfo()
=> 186,000 (pages) [1]
phpinfo()
mbstring => 8,330
phpinfo()
Server API Configure Command => 16,800
phpinfo()
Server API Configure Command mbstring => 4,510
Even so this still represents a MINORITY.
Ilia
As a sidenote, this unrealistic statistics appear to be quite unreal.
phpinfo()
=> 186,000 (pages) [1]
phpinfo()
mbstring => 8,330
phpinfo()
Server API Configure Command => 16,800
phpinfo()
Server API Configure Command mbstring => 4,510Even so this still represents a MINORITY.
Could a quarter be a minority? And the fact is there are lots of
people who don't know what is the problem when they've got to
use multibyte strings. They don't realise why they should not
use UTF-8 with their mysql server set to handle ISO-8859-1
for instance.
Moriyoshi
Could a quarter be a minority?
Unless the rules of mathematics had changed 25% is still a minority. You also
forget that there are plenty of people who compile extensions and never end
up using them.
The critical point of this entire discussion is about NOT forcing choices on
people who do not want/need them. There is no good reason to force multibyte
version of fgetcsv()
on every single user, when there are not one but two PHP
extensions designed explicitly for multibyte support. If fgetcsv()
in PHP 5
cannot be designed in such a way as to have no significant performance
penalties for non-multibyte strings the function should be introduced as
mb_fgetcsv() or iconv_fgetcsv().
Ilia
Could a quarter be a minority?
Unless the rules of mathematics had changed 25% is still a minority.
You also
forget that there are plenty of people who compile extensions and
never end
up using them.
A similar fact applies to your assumption. It's also true that PHP can
handle multibyte strings without mbstring or iconv in some cases where
the users are just fortunate enough to not get in trouble, most likely
because they just don't use such multibyte characters that are known
to cause problems due to its structure. Those user question [1][2]
exactly
describes when it goes wrong.
[1] http://marc.theaimsgroup.com/?l=php-dev&m=103828989330521&w=2
[2] http://news.php.net/article.php?group=php.i18n&article=633
The critical point of this entire discussion is about NOT forcing
choices on
people who do not want/need them. There is no good reason to force
multibyte
version offgetcsv()
on every single user, when there are not one but
two PHP
extensions designed explicitly for multibyte support.
On the other hand, the chances are very limited to users familiar
to multibyte. First of all, flexibility on the configuration has been
causing lots of confusion. I'd be happy if every existing application
used mb_*() instead of their counterpart at approproate places, but
it's unlikely. This is because we have two versions of string
manipulation
functions. And again, it's prevented users to write multibyte safe
applications because multibyte-flavor extensions are currently not
enabled
by default though this fact is not my point.
If
fgetcsv()
in PHP 5 cannot be designed in such a way as to have no
significant performance penalties for non-multibyte strings the
function
should be introduced as mb_fgetcsv() or iconv_fgetcsv().
So, why not begin thinking of how it could be bearably fast
even with multibyte support enabled? While I think the current stuff
I made is the best portable and the fastest code, it's probable
that there are a far better code.
Moriyoshi
I made is the best portable and the fastest code, it's probable
that there are a far better code.
s/there are/there'd be/
The critical point of this entire discussion is about NOT forcing
choices on
people who do not want/need them. There is no good reason to force
multibyte
version offgetcsv()
on every single user, when there are not one but
two PHP
extensions designed explicitly for multibyte support.On the other hand, the chances are very limited to users familiar
to multibyte. First of all, flexibility on the configuration has been
causing lots of confusion. I'd be happy if every existing application
used mb_*() instead of their counterpart at approproate places, but
it's unlikely. This is because we have two versions of string
manipulation
functions. And again, it's prevented users to write multibyte safe
applications because multibyte-flavor extensions are currently not
enabled
by default though this fact is not my point.
Percentages aside you cannot deny the fact that not every application needs
multibyte support (whether this is a majority or 50/50 does not matter).
If a user needs to use multibyte they may need to do a little searching to
find a provider that supports it, fortunately for them PHP is a very common
scripting language with many hosting providers.
If
fgetcsv()
in PHP 5 cannot be designed in such a way as to have no
significant performance penalties for non-multibyte strings the
function
should be introduced as mb_fgetcsv() or iconv_fgetcsv().So, why not begin thinking of how it could be bearably fast
even with multibyte support enabled? While I think the current stuff
I made is the best portable and the fastest code, it's probable
that there are a far better code.
If your code as indeed as fast as it can be then the only alternative it would
seem is to seperate the function into 'normal' and 'multibyte' variants
allowing the user and not the developer to choose the one most suited to
their needs.
Ilia
Percentages aside you cannot deny the fact that not every application
needs
multibyte support (whether this is a majority or 50/50 does not
matter).
If a user needs to use multibyte they may need to do a little
searching to
find a provider that supports it, fortunately for them PHP is a very
common
scripting language with many hosting providers.
I haven't denied it. That said, multibyte facility is not so fancy
as XML, but quite essential so as to enable most applications to work
well under every environment.
So, why not begin thinking of how it could be bearably fast
even with multibyte support enabled? While I think the current stuff
I made is the best portable and the fastest code, it's probable
that there are a far better code.If your code as indeed as fast as it can be then the only alternative
it would
seem is to seperate the function into 'normal' and 'multibyte' variants
allowing the user and not the developer to choose the one most suited
to
their needs.
Let's stop doing such a stupid thing any more. As I pointed out already,
having different versions for each function doesn't solve problems at
all.
Moriyoshi
I haven't denied it. That said, multibyte facility is not so fancy
as XML, but quite essential so as to enable most applications to work
well under every environment.
Bullshit. Only application that need to support multibyte strings need the
multibyte facility.
Let's stop doing such a stupid thing any more. As I pointed out already,
having different versions for each function doesn't solve problems at
all.
It sure does, those who need to slower (multibyte) version use that and those
who don't use the standard version which works nice and fast for
non-multibyte strings.
Ilia
I haven't denied it. That said, multibyte facility is not so fancy
as XML, but quite essential so as to enable most applications to work
well under every environment.Bullshit. Only application that need to support multibyte strings need the
multibyte facility.Let's stop doing such a stupid thing any more. As I pointed out already,
having different versions for each function doesn't solve problems at
all.It sure does, those who need to slower (multibyte) version use that and those
who don't use the standard version which works nice and fast for
non-multibyte strings.
So you think the right solution is to dismiss multibyte users and direct
them to the hacks (mbstring etc) that have been used previously instead
of thinking ahead?
If I were starting a language from scratch today, I would make character
encoding part of the string "zval" structure. IMHO that's where it
belongs. As an alternative for PHP 5[.1], there is room for a
"multibyte bit" in the zval that various functions can use to choose
between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)"
implementations.
- Stig
So you think the right solution is to dismiss multibyte users and direct
them to the hacks (mbstring etc) that have been used previously instead
of thinking ahead?
IMHO calling multibyte a hack would be great disservice to the developers of
that extension. We don't call ext/pgsql a hack, simply because it's not
builtin, do we?
If I were starting a language from scratch today, I would make character
encoding part of the string "zval" structure. IMHO that's where it
belongs. As an alternative for PHP 5[.1], there is room for a
"multibyte bit" in the zval that various functions can use to choose
between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)"
implementations.
If you were designing a new language you wouldn't have legacy users who'd
suffer (significantly) because of features added for other users.
Ilia
So you think the right solution is to dismiss multibyte users and
direct
them to the hacks (mbstring etc) that have been used previously
instead
of thinking ahead?IMHO calling multibyte a hack would be great disservice to the
developers of
that extension. We don't call ext/pgsql a hack, simply because it's not
builtin, do we?
The extension is virtually a hack. Again, the developers of mbstring
had to choose the option, adding support for multiple encodings to PHP
by separating it as an extension, instead of integrating it into the
core implementation because we always ought to manage backwards
compatibilities.
If I were starting a language from scratch today, I would make
character
encoding part of the string "zval" structure. IMHO that's where it
belongs. As an alternative for PHP 5[.1], there is room for a
"multibyte bit" in the zval that various functions can use to choose
between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)"
implementations.If you were designing a new language you wouldn't have legacy users
who'd
suffer (significantly) because of features added for other users.
Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.
Moriyoshi
If you were designing a new language you wouldn't have legacy users
who'd suffer (significantly) because of features added for other
users.Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.
Uh? Where does this wisdom comes form?
Derick
If you were designing a new language you wouldn't have legacy users
who'd suffer (significantly) because of features added for other
users.Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.Uh? Where does this wisdom comes form?
README.PHP4-TO-PHP5-THIN-CHANGES and anything I spotted in
the current code.
Moriyoshi
If you were designing a new language you wouldn't have legacy users
who'd suffer (significantly) because of features added for other
users.Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.Uh? Where does this wisdom comes form?
README.PHP4-TO-PHP5-THIN-CHANGES and anything I spotted in
the current code.
I know there are changes, I was inquiring about the "significantly" in
your statement.
Derick
Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.
How so? PHP 5 does break BC (especially for objects) but this is something
that was talked about for years and the consensus is/was that the change is
for the better. To my knowledge, majority of the new features in PHP5 are
just that and have no side effects.
Another alternative to moving fgetcsv()
's current implementation would be to
add #ifdef HAVE_MBSTRING around php_mblen, which would do
#define php_mblen(ptr, len) 1
if mbstring is not enabled.
Personally, I'd prefer to have the function in mbstring and modified further
to support multibyte delimiters and enclosures, which it does not do now due
to performance considerations.
Ilia
Stig S. Bakken wrote:
I haven't denied it. That said, multibyte facility is not so fancy
as XML, but quite essential so as to enable most applications to work
well under every environment.Bullshit. Only application that need to support multibyte strings need the
multibyte facility.Let's stop doing such a stupid thing any more. As I pointed out already,
having different versions for each function doesn't solve problems at
all.It sure does, those who need to slower (multibyte) version use that and those
who don't use the standard version which works nice and fast for
non-multibyte strings.So you think the right solution is to dismiss multibyte users and direct
them to the hacks (mbstring etc) that have been used previously instead
of thinking ahead?
I see no example of him implying he wanted to "dismiss" multibyte users,
he simply suggested mb_* versions of the string manipulation functions
and pointed available facilities that people can use already. I support
that idea, as having a mb_ version and a version without multibyte
support gives everyone what they want. People who want multibyte strings
have it, and people who want speed without multibyte strings still have
that; everyone should be happy. Those who don't need multibyte strings
(the majority, by a long shot) don't have to suffer any performance
loss, while those in Asia can open that marketshare you speak of.
If I were starting a language from scratch today, I would make character
encoding part of the string "zval" structure. IMHO that's where it
belongs. As an alternative for PHP 5[.1], there is room for a
"multibyte bit" in the zval that various functions can use to choose
between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)"
implementations.
- Stig
I see no example of him implying he wanted to "dismiss" multibyte users,
he simply suggested mb_* versions of the string manipulation functions
and pointed available facilities that people can use already. I support
that idea, as having a mb_ version and a version without multibyte
support gives everyone what they want. People who want multibyte strings
have it, and people who want speed without multibyte strings still have
that; everyone should be happy. Those who don't need multibyte strings
(the majority, by a long shot) don't have to suffer any performance
loss, while those in Asia can open that marketshare you speak of.
It is a dismissal in the sense that existing apps not written explicitly
for multibyte support will not work for nearly half the users of PHP. We
are not talking about a small group of users here.
As Stig says, the correct solution would be to always store the encoding
of the string right alongside the length of the string in the guts of PHP.
Anything short of that is going to be a hack. PHP6 here we come...
-Rasmus
As Stig says, the correct solution would be to always store the encoding
of the string right alongside the length of the string in the guts of PHP.
Anything short of that is going to be a hack. PHP6 here we come...
Then here is our first TODO for PHP6 :). But until then, please, let's try to
avoid adding hacks that implement partial mb support in select functions.
These hacks serve no one, many users loose scalability and other users get
partial support that will likely prevent them from using the new
functionality.
Ilia
We need to move towards a uniform platform that works for everyone without
putting undue strain on either side.
Sure we do, but not at a 200-250% performance loss.
Derick
Zitat von Ilia Alshanetsky ilia@prohost.org:
And I don't think
fgetcsv()
is an exception, sincehtmlentities()
can
be referred to as an example that is placed in core and
supports multibyte strings. As I mentioned, purging that kind of
functionality into the mbstring extension doesn't solve the problem
in practice by any means.
htmlentities()
is a rather special function it handles not only multibyte
but
a whole lot of diffrent & unusual things. I do not think you can fairly
compare it tofgetcsv()
. We have a multibyte extension for people who
need
that functionality, why force it on everyone else?
- IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.When it's a few percent and the changes offer significant
improvements
yes,
but when were are talking about a performance loss of 250-300% or
more
then
performance must become a consideration as well.If there are virtually no ways to improve it, it'd be natural to me
we dismiss the issue.Why does a vast majority of users have to endure degredation in
performance
for functionality that are needed by a few? It's as simple as that. Same
argument applies tobasename()
.
Just a general note on this discussion becoming sort of a "meta"-topic:
From a PHP developers POV, complete charset support should be a key
technology for ZE and the standard extensions, as is now XML for PHP5 as a
whole. While the comparison might be a bit strange, it even reminds on the
relation of these two: The "standard" encoding for XML data is a multibyte
charset.
But the real problem is, that it's really hard for developers outside of
the "multibyte" world to understand the ins and outs of these charsets and
how to handle them correctly. It was a PITA to make the whole Horde
framework charset independent without knowing anything on mb charsets and
their support in php.
I did this due to popular demand, because there are a lot of people
using/needing mb charsets. It would be great if others developers wouldn't
have to take this steep road, because php would support these out of the
box.
While writing this message, Rasmus got my point in fewer words. ;-)
Jan.
--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft
Before we get too far offtopic ( fgetcsv()
) let me just quickly summarize my
position. I think it's great that multi-byte support exists in PHP, we have
mbstring, iconv and recode extensions that all help make PHP work with
multibyte strings.
As with most things with PHP the user has a choice of whether or not to enable
this functionality depending on their present and future needs. By putting
this functionality into core/standard this choice is taken away more over it
penalizes (rather significantly) every single user of affected functionality.
That is what I'd like to avoid.
Without mbstring enabled, you would not be able to effectively work with
multibyte characters. Therefor even if fgetcsv()
would work as you may expect
with multibyte strings, that data would not be manageable in most cases.
Which is why I propose that multibyte version of fgetcsv()
be placed inside
ext/mbstring as mb_fgetcsv().
Ilia
Without mbstring enabled, you would not be able to effectively work
with
multibyte characters. Therefor even iffgetcsv()
would work as you may
expect
with multibyte strings, that data would not be manageable in most
cases.
That's a bogus argument. In what case do you think they should be
effectively managed with mbstring functions..? As PHP's functions
are mostly highly-completed on its role and they don't need any
additional operations in most cases, I don't have to use mbstring
functions there.
String manipulation functions provided by mbstring are atomic
and so they are used just in the same way as
their standard counterparts like substr()
and strpos()
.
I see a few cases that the strings fetched by fgetcsv()
have to be modified by substr()
or whatever..
Moriyoshi
Without mbstring enabled, you would not be able to effectively work
with
multibyte characters. Therefor even iffgetcsv()
would work as you may
expect
with multibyte strings, that data would not be manageable in most
cases.
What if you want to alter the case of the data or use it a basis for sending
e-mail or check it against another string via regular expression? Sure, if
you just want to read them and echo them back in the form of a table you
don't need mbstring.
It also seems to me that if you are going to be multibyte inputs you'd have
either iconv or mbstring extension enabled.
Ilia
It also seems to me that if you are going to be multibyte inputs you'd
have
either iconv or mbstring extension enabled.
Which input? If you are talking about form inputs, we rarely need the
functionality (mbstring.encoding_conversion).
That is needed when the encoding in which a script is written and
the one the form uses to submit to the script. Such situation might look
weird to you, but it happens in case the output of the script
is automagically translated by mb_output_handler()
.
Please take a look at the relevant documentations.
Moriyoshi
That is needed when the encoding in which a script is written and
the one the form uses to submit to the script.
A few more words were missing:
That is needed when the encoding in which a script is written and
the one the form uses to submit to the script are different.
Moriyoshi
Which input?
If you get multibyte data from a form and want to perform string operations on
it such as strlen()
, ereg(), etc... would you not need mb_* or iconv_*
functions?
Ilia
Which input?
If you get multibyte data from a form and want to perform string
operations on
it such asstrlen()
, ereg(), etc... would you not need mb_* or iconv_*
functions?
In such a case, yes, I do. But I don't think that's directly related
to the issue..?
Moriyoshi
I disagree, because of the following reasons:
- Not a few people actually use
fgetcsv()
commonly
with multibyte characters indeed. Regarding this,
applications made by those who don't use
such characters don't (and won't) use multibyte specific
functions and that's the problem. This greatly prevents
them from being portable.People have lived without multibyte support in
fgetcsv()
for many years now,
and I did not see a single request on bugs.php.net forfgetcsv()
multi-byte
support. So, while this is certainly useful functionality I do not believe it
is as widely needed as you say it is. We also have a multibyte extension that
already implements multi-byte safe variants of common functions, why make
exception forfgetcsv()
and add multibyte code into core?
Just an observation: it seems that the PHP users who need multibyte
support are generally self-supplied by default. It's often hard to
convince programmers to change their code as fundamentally as you often
need to do to support not just UTF-8 but the whole range of CJK
charsets, it adds complexity and can slow things down. These users are
used to maintaining their own patches for all kinds of software. The
process of merging in multibyte character features often takes several
years.
Because of this (if my observation is correct), you can't really tell
for example how many Japanese users are having issues with fgetcsv()
by
counting requests on bugs.php.net.
I agree with Moriyoshi Koizumi that performance is not necessarily the
primary factor here. IMHO performance is important, but generality and
realibility is more so.
With all due respect to everyone, I think that we should be a bit more
welcoming to people who offer help in making PHP a better language for
CJK websites. There's still a huge amount of marketshare waiting for
PHP in Asia. :-)
- Stig
If we limited the support to UTF-8 or EUC encoding only, we'd be
able to drastically gain much better performance. But it won't
actually solve practical problems where it is in action.
Could iconv stream filters be used to convert various encoding (if needed) to
UTF-8 thus addressing the problem?
Ilia
If we limited the support to UTF-8 or EUC encoding only, we'd be
able to drastically gain much better performance. But it won't
actually solve practical problems where it is in action.Could iconv stream filters be used to convert various encoding (if
needed) to
UTF-8 thus addressing the problem?
Actually it might do, but doing so leads to great overheads, because
you have to reconvert the strings to restore the initial form.
Besides the conversion is sometimes irreversible. And even it wouldn't
make sense unless iconv extension becomes built-in.
Moriyoshi
On a related note I should mention that
fgetcsv()
in 4.3.X is
currently 2.5
times faster then it's equivalent in 5.X.I don't know why you're mentioning this at this time,
but I can say it is a sort of necessary evil :) Because the HEAD
version is capable of handling various encodings, and
less intricate IMO. Rather, I was surprised about that result,
it's only 2.5 times slower :)
I would call that rather unacceptable actually. Isn't it possible create
a new function for this which handles this MB 'crap' (and the same for
basename) so that we don't have to lose performance because of those
issues?
regards,
Derick
I would call that rather unacceptable actually. Isn't it possible
create
a new function for this which handles this MB 'crap' (and the same for
basename) so that we don't have to lose performance because of those
issues?
Don't you think "crap" sounds too disgusting and inappropriate?
Stop such wording here.
Moriyoshi
On a related note I should mention that
fgetcsv()
in 4.3.X is
currently 2.5
times faster then it's equivalent in 5.X.I don't know why you're mentioning this at this time,
but I can say it is a sort of necessary evil :) Because the HEAD
version is capable of handling various encodings, and
less intricate IMO. Rather, I was surprised about that result,
it's only 2.5 times slower :)I would call that rather unacceptable actually. Isn't it possible create
a new function for this which handles this MB 'crap' (and the same for
"crap" is a poor choice of words, I had no plans to insult you in
anyway. My apologies for that.
Derick
What do you think of this?
I think the new behavior is correct, and FYI, Excel's format is "a,b,c"
and not "a, b, c" anyway.
Derick
I think the new behavior is correct, and FYI, Excel's format is "a,b,c"
and not "a, b, c" anyway.
I mean the output could be like "a, b, c" with the leading spaces.
The old code strips such spaces that are not in enclosures,
which I think is somewhat wrong as you say.
Moriyoshi