Deprecate declare(encoding='...') + zend.multibyte + zend.script_encoding + zend.detect_unicode ?

1 year ago by Hans Henrik Bergan — view source

unread

With the dominance of UTF-8 (a fixed-endian encoding), surely no new
code should utilize any of declare(encoding='...') / zend.multibyte /
zend.script_encoding / zend.detect_unicode.
I propose we deprecate all 4.

1 year ago by Claude Pache — view source

unread

Le 28 nov. 2023 à 19:57, Hans Henrik Bergan divinity76@gmail.com a écrit :

With the dominance of UTF-8 (a fixed-endian encoding), surely no new
code should utilize any of declare(encoding='...') / zend.multibyte /
zend.script_encoding / zend.detect_unicode.
I propose we deprecate all 4.

Hi,

What is the migration path for legacy code that use those directives?

—Claude

1 year ago by Kamil Tekiela — view source

unread

Hi Hans,

Can you share a little more details about how this works? This is a
pretty niche functionality, so most people probably don't know what it
is, how it works, or why it should no longer be used. Also, as Claude
mentioned, what is the preferred alternative?

Regards,
Kamil

1 year ago by Dusk — view source

unread

Le 28 nov. 2023 à 19:57, Hans Henrik Bergan divinity76@gmail.com a écrit :

With the dominance of UTF-8 (a fixed-endian encoding), surely no new
code should utilize any of declare(encoding='...') / zend.multibyte /
zend.script_encoding / zend.detect_unicode.
I propose we deprecate all 4.

What is the migration path for legacy code that use those directives?

Convert your PHP source files to UTF-8. These directives are only required for code written in legacy multibyte encodings like Shift-JIS, Big5, or EUC-CN. (These encodings are primarily used for Chinese and Japanese text.)

These directives are not required for scripts which process text in these encodings. They're only required if the source code itself is in a legacy multibyte encoding, as those encodings can contain octets in the basic ASCII range (0x20 - 0x7f) within multibyte sequences. For example, the character "ボ" (U+30DC KATAKANA LETTER BO) is encoded in Shift-JIS as 83 7B, whose second octet would ordinarily represent the ASCII character "{". If this character appeared in a variable name, for instance, PHP would need to recognize that the "7B" does not represent open brace.

With the dominance of UTF-8 (a fixed-endian encoding)

I'll add that what's special about UTF-8 isn't that it's "fixed-endian". It's that UTF-8 only uses octets above 0x7F for characters outside the ASCII range, so the parser doesn't have to be specifically aware of UTF-8 encoding when processing text.

1 year ago by Kamil Tekiela — view source

unread

Convert your PHP source files to UTF-8.

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of
this declare?

1 year ago by Claude Pache — view source

unread

Le 28 nov. 2023 à 20:56, Kamil Tekiela tekiela246@gmail.com a écrit :

Convert your PHP source files to UTF-8.

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of
this declare?

It is not necessary as simple: because your code base may contain literal strings, and changing the encoding of the source file will effectively change the contents of the strings.

—Claude

1 year ago by Hans Henrik Bergan — view source

unread

What is the migration path for legacy code that use those directives?

The migration path is to convert the legacy-encoding PHP files to UTF-8.
Luckily this can be largely automated, here is my attempt:
https://github.com/divinity76/php2utf8/blob/main/src/php2utf8.php
but that code definitely needs some proof-reading and additions - idk
if the approach used is even a good approach, it was just the first i
could think of, feel free to write one from scratch

Can you share a little more details about how this works?

I hope someone else can do that, but it allows PHP to parse and
execute scripts not written in UTF-8 and scripts utilizing
BOM/byte-order-masks.

add that what's special about UTF-8 isn't that it's "fixed-endian".

one of multiple good things about UTF-8 is that it's fixed-endian, and
UTF8 don't need a BOM to specify endianess (unlike UTF16 and UTF32
which are bi-endian, and a BOM helps identify endianess used~)

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of

I've read your question but don't have an answer to it, hopefully
someone else knows.

Le 28 nov. 2023 à 20:56, Kamil Tekiela tekiela246@gmail.com a écrit :

Convert your PHP source files to UTF-8.

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of
this declare?

It is not necessary as simple: because your code base may contain literal strings, and changing the encoding of the source file will effectively change the contents of the strings.

—Claude

1 year ago by Hans Henrik Bergan — view source

unread

btw if we come to some consensus to my php2utf8.php script is actually
worthwhile to expand on, i can volunteer to add more encodings (SJIS,
BIG5, anything supported by mbstring),
but it wouldn't surprise me if a better approach exist and the script
should be rewritten entirely~

add that what's special about UTF-8 isn't that it's "fixed-endian".

should've added this to the last post, but the "zend.detect_unicode"
ini-option is specifically to scan for BOMs, and BOMs are
significantly less useful in fixed-endian encodings (like UTF8) than
bi-endian encodings (like UTF16/UTF32) ^^

What is the migration path for legacy code that use those directives?

The migration path is to convert the legacy-encoding PHP files to UTF-8.
Luckily this can be largely automated, here is my attempt:
https://github.com/divinity76/php2utf8/blob/main/src/php2utf8.php
but that code definitely needs some proof-reading and additions - idk
if the approach used is even a good approach, it was just the first i
could think of, feel free to write one from scratch

Can you share a little more details about how this works?

I hope someone else can do that, but it allows PHP to parse and
execute scripts not written in UTF-8 and scripts utilizing
BOM/byte-order-masks.

add that what's special about UTF-8 isn't that it's "fixed-endian".

one of multiple good things about UTF-8 is that it's fixed-endian, and
UTF8 don't need a BOM to specify endianess (unlike UTF16 and UTF32
which are bi-endian, and a BOM helps identify endianess used~)

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of

I've read your question but don't have an answer to it, hopefully
someone else knows.

Le 28 nov. 2023 à 20:56, Kamil Tekiela tekiela246@gmail.com a écrit :

Convert your PHP source files to UTF-8.

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of
this declare?

It is not necessary as simple: because your code base may contain literal strings, and changing the encoding of the source file will effectively change the contents of the strings.

—Claude

1 year ago by youkidearitai — view source

unread

2023年11月29日(水) 7:41 Hans Henrik Bergan divinity76@gmail.com:

btw if we come to some consensus to my php2utf8.php script is actually
worthwhile to expand on, i can volunteer to add more encodings (SJIS,
BIG5, anything supported by mbstring),
but it wouldn't surprise me if a better approach exist and the script
should be rewritten entirely~

add that what's special about UTF-8 isn't that it's "fixed-endian".

should've added this to the last post, but the "zend.detect_unicode"
ini-option is specifically to scan for BOMs, and BOMs are
significantly less useful in fixed-endian encodings (like UTF8) than
bi-endian encodings (like UTF16/UTF32) ^^

What is the migration path for legacy code that use those directives?

The migration path is to convert the legacy-encoding PHP files to UTF-8.
Luckily this can be largely automated, here is my attempt:
https://github.com/divinity76/php2utf8/blob/main/src/php2utf8.php
but that code definitely needs some proof-reading and additions - idk
if the approach used is even a good approach, it was just the first i
could think of, feel free to write one from scratch

Can you share a little more details about how this works?

I hope someone else can do that, but it allows PHP to parse and
execute scripts not written in UTF-8 and scripts utilizing
BOM/byte-order-masks.

add that what's special about UTF-8 isn't that it's "fixed-endian".

one of multiple good things about UTF-8 is that it's fixed-endian, and
UTF8 don't need a BOM to specify endianess (unlike UTF16 and UTF32
which are bi-endian, and a BOM helps identify endianess used~)

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of

I've read your question but don't have an answer to it, hopefully
someone else knows.

Le 28 nov. 2023 à 20:56, Kamil Tekiela tekiela246@gmail.com a écrit :

Convert your PHP source files to UTF-8.

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of
this declare?

It is not necessary as simple: because your code base may contain literal strings, and changing the encoding of the source file will effectively change the contents of the strings.

—Claude

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi, Hans

Is this convert PHP code from any encoding to UTF-8?
If correct, PHP code is coded various character encoding,
It is very difficult.
This is because it is not necessarily implemented in UTF-8.

In the world, we have many character encoding.
PHP code will be difficult to unify.

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Hans Henrik Bergan — view source

unread

@youkidearitai right now the code specifically deals with

UTF8: removing UTF8 BOM and removing `declare(encoding='UTF-8');
UTF16LE/UTF16BE/UTF32LE/UTF32BE: converting to UTF8 removing the BOM
and removing declare(encoding='...')
ISO-8859-1: converting to UTF-8 and removing
declare(encoding='ISO-8859-1'), i couldn't really find information on
a ISO-8859-1 BOM, so to the best of my knowledge it does not exist

it does not deal with any other encodings as of writing, but more can
be added if needed.

2023年11月29日(水) 7:41 Hans Henrik Bergan divinity76@gmail.com:

btw if we come to some consensus to my php2utf8.php script is actually
worthwhile to expand on, i can volunteer to add more encodings (SJIS,
BIG5, anything supported by mbstring),
but it wouldn't surprise me if a better approach exist and the script
should be rewritten entirely~

add that what's special about UTF-8 isn't that it's "fixed-endian".

should've added this to the last post, but the "zend.detect_unicode"
ini-option is specifically to scan for BOMs, and BOMs are
significantly less useful in fixed-endian encodings (like UTF8) than
bi-endian encodings (like UTF16/UTF32) ^^

What is the migration path for legacy code that use those directives?

The migration path is to convert the legacy-encoding PHP files to UTF-8.
Luckily this can be largely automated, here is my attempt:
https://github.com/divinity76/php2utf8/blob/main/src/php2utf8.php
but that code definitely needs some proof-reading and additions - idk
if the approach used is even a good approach, it was just the first i
could think of, feel free to write one from scratch

Can you share a little more details about how this works?

I hope someone else can do that, but it allows PHP to parse and
execute scripts not written in UTF-8 and scripts utilizing
BOM/byte-order-masks.

add that what's special about UTF-8 isn't that it's "fixed-endian".

one of multiple good things about UTF-8 is that it's fixed-endian, and
UTF8 don't need a BOM to specify endianess (unlike UTF16 and UTF32
which are bi-endian, and a BOM helps identify endianess used~)

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of

I've read your question but don't have an answer to it, hopefully
someone else knows.

Le 28 nov. 2023 à 20:56, Kamil Tekiela tekiela246@gmail.com a écrit :

Convert your PHP source files to UTF-8.

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of
this declare?

It is not necessary as simple: because your code base may contain literal strings, and changing the encoding of the source file will effectively change the contents of the strings.

—Claude

--

To unsubscribe, visit: https://www.php.net/unsub.php

Hi, Hans

Is this convert PHP code from any encoding to UTF-8?
If correct, PHP code is coded various character encoding,
It is very difficult.
This is because it is not necessarily implemented in UTF-8.

In the world, we have many character encoding.
PHP code will be difficult to unify.

Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

--

To unsubscribe, visit: https://www.php.net/unsub.php

1 year ago by youkidearitai — view source

unread

2023年11月29日(水) 8:07 Hans Henrik Bergan divinity76@gmail.com:

@youkidearitai right now the code specifically deals with

UTF8: removing UTF8 BOM and removing `declare(encoding='UTF-8');

UTF16LE/UTF16BE/UTF32LE/UTF32BE: converting to UTF8 removing the BOM
and removing declare(encoding='...')

ISO-8859-1: converting to UTF-8 and removing
declare(encoding='ISO-8859-1'), i couldn't really find information on
a ISO-8859-1 BOM, so to the best of my knowledge it does not exist

it does not deal with any other encodings as of writing, but more can
be added if needed.

Hi, Hans

I see. I understand the argument.
At least, Japanese character encoding seems not using declare(encoding=...).

Probably, we use zend_encoding implicitly.
If delete zend_encoding, In SJIS (Shift_JIS) probably will occur 5c problem.

For example is below:

$val = "表"; // 表 is 0x955c, script see 0x5c22, therefore, Throw on Parse Error

Please see about 5c problem https://blog.kano.ac/archive/posts/1654_5c-problem/

I would like to maintain backwards compatibility. zend_encoding seems
can't delete.

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Hans Henrik Bergan — view source

unread

Do you have access to a project actually using Shift_JIS? Interesting!
I thought they were practically unicorns / non-existent running PHP4,

Can you run

var_dump(array(
    "biao_hex" => bin2hex("表"),
    "zend.multibyte" => ini_get("zend.multibyte"),
    "zend.script_encoding" => ini_get("zend.script_encoding"),
    "zend.detect_unicode" => ini_get("zend.detect_unicode"),
    "mbstring.internal_encoding" => ini_get("mbstring.internal_encoding"),
    "mbstring.func_overload" => ini_get("mbstring.func_overload"),
    "PHP_VERSION" => PHP_VERSION,
));

there? What do you get?

2023年11月29日(水) 8:07 Hans Henrik Bergan divinity76@gmail.com:

@youkidearitai right now the code specifically deals with

UTF8: removing UTF8 BOM and removing `declare(encoding='UTF-8');

UTF16LE/UTF16BE/UTF32LE/UTF32BE: converting to UTF8 removing the BOM
and removing declare(encoding='...')

ISO-8859-1: converting to UTF-8 and removing
declare(encoding='ISO-8859-1'), i couldn't really find information on
a ISO-8859-1 BOM, so to the best of my knowledge it does not exist

it does not deal with any other encodings as of writing, but more can
be added if needed.

Hi, Hans

I see. I understand the argument.
At least, Japanese character encoding seems not using declare(encoding=...).

Probably, we use zend_encoding implicitly.
If delete zend_encoding, In SJIS (Shift_JIS) probably will occur 5c problem.

For example is below:

$val = "表"; // 表 is 0x955c, script see 0x5c22, therefore, Throw on Parse Error

Please see about 5c problem https://blog.kano.ac/archive/posts/1654_5c-problem/

I would like to maintain backwards compatibility. zend_encoding seems
can't delete.

Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

--

To unsubscribe, visit: https://www.php.net/unsub.php

1 year ago by Hans Henrik Bergan — view source

unread

actually scratch that, run

var_dump(array(
    "biao_hex" => bin2hex("表"),
    "zend.multibyte" => ini_get("zend.multibyte"),
    "zend.script_encoding" => ini_get("zend.script_encoding"),
    "zend.detect_unicode" => ini_get("zend.detect_unicode"),
    "mbstring.internal_encoding" => ini_get("mbstring.internal_encoding"),
    "mbstring.func_overload" => ini_get("mbstring.func_overload"),
    "PHP_VERSION" => PHP_VERSION,
    "raw_script_bytes" => bin2hex(file_get_contents(__FILE__)),
));

what do you get?

Do you have access to a project actually using Shift_JIS? Interesting!
I thought they were practically unicorns / non-existent running PHP4,

Can you run
var_dump(array(
    "biao_hex" => bin2hex("表"),
    "zend.multibyte" => ini_get("zend.multibyte"),
    "zend.script_encoding" => ini_get("zend.script_encoding"),
    "zend.detect_unicode" => ini_get("zend.detect_unicode"),
    "mbstring.internal_encoding" => ini_get("mbstring.internal_encoding"),
    "mbstring.func_overload" => ini_get("mbstring.func_overload"),
    "PHP_VERSION" => PHP_VERSION,
));
there? What do you get?

2023年11月29日(水) 8:07 Hans Henrik Bergan divinity76@gmail.com:

@youkidearitai right now the code specifically deals with

UTF8: removing UTF8 BOM and removing `declare(encoding='UTF-8');

UTF16LE/UTF16BE/UTF32LE/UTF32BE: converting to UTF8 removing the BOM
and removing declare(encoding='...')

ISO-8859-1: converting to UTF-8 and removing
declare(encoding='ISO-8859-1'), i couldn't really find information on
a ISO-8859-1 BOM, so to the best of my knowledge it does not exist

it does not deal with any other encodings as of writing, but more can
be added if needed.

Hi, Hans

I see. I understand the argument.
At least, Japanese character encoding seems not using declare(encoding=...).

Probably, we use zend_encoding implicitly.
If delete zend_encoding, In SJIS (Shift_JIS) probably will occur 5c problem.

For example is below:

$val = "表"; // 表 is 0x955c, script see 0x5c22, therefore, Throw on Parse Error

Please see about 5c problem https://blog.kano.ac/archive/posts/1654_5c-problem/

I would like to maintain backwards compatibility. zend_encoding seems
can't delete.

Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

--

To unsubscribe, visit: https://www.php.net/unsub.php

1 year ago by youkidearitai — view source

unread

2023年11月29日(水) 9:04 Hans Henrik Bergan divinity76@gmail.com:

Do you have access to a project actually using Shift_JIS? Interesting!
I thought they were practically unicorns / non-existent running PHP4,

Can you run

var_dump(array(
    "biao_hex" => bin2hex("表"),
    "zend.multibyte" => ini_get("zend.multibyte"),
    "zend.script_encoding" => ini_get("zend.script_encoding"),
    "zend.detect_unicode" => ini_get("zend.detect_unicode"),
    "mbstring.internal_encoding" => ini_get("mbstring.internal_encoding"),
    "mbstring.func_overload" => ini_get("mbstring.func_overload"),
    "PHP_VERSION" => PHP_VERSION,
));

Hi, Hans

I'm trying to above code.

Nothing config:
❯ ~/php82/bin/php deprecate_zend_scriptencoding.php
PHP Parse error: syntax error, unexpected identifier "zend",
expecting ")" in
/Users/youkidearitai/deprecate_zend_scriptencoding.php on line 5

Parse error: syntax error, unexpected identifier "zend", expecting ")"
in /Users/youkidearitai/deprecate_zend_scriptencoding.php on line 5

Use zend.script_encoding=sjis and zend_bultibyte=true

❯ ~/php82/bin/php -d zend.script_encoding=sjis -d zend.multibyte=true
deprecate_zend_scriptencoding.php
array(7) {
["biao_hex"]=>
string(6) "e8a1a8"
["zend.multibyte"]=>
string(1) "1"
["zend.script_encoding"]=>
string(4) "sjis"
["zend.detect_unicode"]=>
string(1) "1"
["mbstring.internal_encoding"]=>
string(0) ""
["mbstring.func_overload"]=>
bool(false)
["PHP_VERSION"]=>
string(5) "8.2.8"
}

Therefore, zend.script_encoding and zend.multibyte is very important.

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by youkidearitai — view source

unread

Use zend.script_encoding=sjis and zend_bultibyte=true

❯ ~/php82/bin/php -d zend.script_encoding=sjis -d zend.multibyte=true
deprecate_zend_scriptencoding.php
array(7) {
["biao_hex"]=>
string(6) "e8a1a8"
["zend.multibyte"]=>
string(1) "1"
["zend.script_encoding"]=>
string(4) "sjis"
["zend.detect_unicode"]=>
string(1) "1"
["mbstring.internal_encoding"]=>
string(0) ""
["mbstring.func_overload"]=>
bool(false)
["PHP_VERSION"]=>
string(5) "8.2.8"
}

Strictly, include internal_encoding.

❯ ~/php82/bin/php -d zend.script_encoding=sjis -d
internal_encoding=sjis -d zend.multibyte=true
deprecate_zend_scriptencoding.php
array(7) {
["biao_hex"]=>
string(4) "955c"
["zend.multibyte"]=>
string(1) "1"
["zend.script_encoding"]=>
string(4) "sjis"
["zend.detect_unicode"]=>
string(1) "1"
["mbstring.internal_encoding"]=>
string(0) ""
["mbstring.func_overload"]=>
bool(false)
["PHP_VERSION"]=>
string(5) "8.2.8"
}

--

Yuya Hamada (tekimen)

1 year ago by Hans Henrik Bergan — view source

unread

i think Shift_JIS can also be automatically converted to UTF-8, does
this seem right?
https://github.com/divinity76/php2utf8/commit/6e08c4c16312961170cce821195816a8d24e23f6

Use zend.script_encoding=sjis and zend_bultibyte=true

❯ ~/php82/bin/php -d zend.script_encoding=sjis -d zend.multibyte=true
deprecate_zend_scriptencoding.php
array(7) {
["biao_hex"]=>
string(6) "e8a1a8"
["zend.multibyte"]=>
string(1) "1"
["zend.script_encoding"]=>
string(4) "sjis"
["zend.detect_unicode"]=>
string(1) "1"
["mbstring.internal_encoding"]=>
string(0) ""
["mbstring.func_overload"]=>
bool(false)
["PHP_VERSION"]=>
string(5) "8.2.8"
}

Strictly, include internal_encoding.

❯ ~/php82/bin/php -d zend.script_encoding=sjis -d
internal_encoding=sjis -d zend.multibyte=true
deprecate_zend_scriptencoding.php
array(7) {
["biao_hex"]=>
string(4) "955c"
["zend.multibyte"]=>
string(1) "1"
["zend.script_encoding"]=>
string(4) "sjis"
["zend.detect_unicode"]=>
string(1) "1"
["mbstring.internal_encoding"]=>
string(0) ""
["mbstring.func_overload"]=>
bool(false)
["PHP_VERSION"]=>
string(5) "8.2.8"
}

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

--

To unsubscribe, visit: https://www.php.net/unsub.php

1 year ago by youkidearitai — view source

unread

2023年11月29日(水) 20:42 Hans Henrik Bergan divinity76@gmail.com:

i think Shift_JIS can also be automatically converted to UTF-8, does
this seem right?
https://github.com/divinity76/php2utf8/commit/6e08c4c16312961170cce821195816a8d24e23f6

Sorry if it's harsh, not right.
Shift_JIS is very ambiguous, What will we do if SJIS-2004 or SJIS-win comes?
How do we guess(detect) SJIS-2004, SJIS-win and SJIS-mac?

<?php
// Comparison table from https://uic.io/en/charset/compare/shiftjis2004/cp932/
var_dump("\xfc\x40"); // What is 0xFC40, 騱(SJIS-2004) or 髜(SJIS-win)?
?>

In the first place, We should not change PHP script character encoding.
In addition to this, We have to think about various things.
This is not just a Japanese problem.

--

Yuya Hamada (tekimen)

1 year ago by youkidearitai — view source

unread

2023年11月29日(水) 21:16 youkidearitai youkidearitai@gmail.com:

2023年11月29日(水) 20:42 Hans Henrik Bergan divinity76@gmail.com:

i think Shift_JIS can also be automatically converted to UTF-8, does
this seem right?
https://github.com/divinity76/php2utf8/commit/6e08c4c16312961170cce821195816a8d24e23f6

Sorry if it's harsh, not right.
Shift_JIS is very ambiguous, What will we do if SJIS-2004 or SJIS-win comes?
How do we guess(detect) SJIS-2004, SJIS-win and SJIS-mac?

<?php
// Comparison table from https://uic.io/en/charset/compare/shiftjis2004/cp932/
var_dump("\xfc\x40"); // What is 0xFC40, 騱(SJIS-2004) or 髜(SJIS-win)?
?>

In the first place, We should not change PHP script character encoding.
In addition to this, We have to think about various things.
This is not just a Japanese problem.

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

I'm sorry if offend and reposting.
The problem is easy understand.
What do we detect ISO-8859 series?

<?php
var_dump("\xca"); // What is character?
?>

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Ayesh Karunaratne — view source

unread

Shift_JIS is very ambiguous, What will we do if SJIS-2004 or SJIS-win comes?
How do we guess(detect) SJIS-2004, SJIS-win and SJIS-mac?

I'm not the person you replied to in your previous email, but I
thought to weigh in with what I can. My native language also uses
multiple bytes, and have done a fair bit of character encoding
conversions from one to another.

The very reason why we have character encoding sets is to be able to
reassign the same byte values to multiple real-life characters, so
changing the character encodings from a non-UTF charset always carries
some sort of "risk" of detecting the wrong source text encoding. Like
Yuya Hamada mentioned in the rest of the previous email, 0xFC40 for
example can map to two different characters. These are quite common
occurrences, and there is even a word (Mojibake) for it!

The most robust projects in this space are probably enca and
Chardet (Python). However, theoretically, all tools can only guess
the text encoding by inspecting common patterns and by checking if all
bytes map to a meaningful glyph. When there is not a lot of text to
inspect, these tools are very prone to make wrong results.

When the source encoding is correctly detected or known, it's easy to
re-encode files using iconv, followed by a quick sed to remove the
declare() calls.

That said, I'm hugely in favor of dropping support for non-UTF8
encodings. Because the source encoding is present in the INI settings
or the declare statement, the site owners should be able to
mass-encode text to UTF-8. Many languages like Rust only support UTF-8
(https://doc.rust-lang.org/reference/input-format.html), and I don't
think any new PHP developers will expect PHP to work with non-UTF8
encodings in the first place.

1 year ago by Mark Trapp — view source

unread

On Tue, Nov 28, 2023 at 12:48 PM Hans Henrik Bergan
divinity76@gmail.com wrote:

If the solution is as easy as just converting the encoding of the
source file, then why did we even need to have this setting at all?
Why did PHP parser support encodings that demanded the introduction of

I've read your question but don't have an answer to it, hopefully
someone else knows.

These settings predate the ubiquity of UTF-8, which did not begin to
see widespread adoption until the mid-to-late 2000s, and did not reach
ubiquity until the mid-2010s:
https://en.wikipedia.org/wiki/Popularity_of_text_encodings

mbstring.script_encoding was introduced with this commit and released
in PHP 4.3 (renamed to zend.script_encoding in PHP 5.4):
https://github.com/php/php-src/commit/f30b722f14521fbad2fabe5fdcaa2b60fe97eebb

zend.detect_unicode introduced in this commit, released with PHP 5.1:
https://github.com/php/php-src/commit/a8c6b992b8894763c59276c1142971aa9a314500

zend.multibyte introduced with this commit, released with PHP 5.4:
https://github.com/php/php-src/commit/ab93d8c621645e05d6a6a431d52ac64eda956673

declare(encoding) appears to predate all of the PHP 4.0 tagged
releases, including the pre-release ones.

Mark Trapp

1 year ago by Claude Pache — view source

unread

Le 28 nov. 2023 à 21:47, Hans Henrik Bergan divinity76@gmail.com a écrit :

What is the migration path for legacy code that use those directives?

The migration path is to convert the legacy-encoding PHP files to UTF-8.
Luckily this can be largely automated, here is my attempt:
https://github.com/divinity76/php2utf8/blob/main/src/php2utf8.php
but that code definitely needs some proof-reading and additions - idk
if the approach used is even a good approach, it was just the first i
could think of, feel free to write one from scratch

Hi,

Converting the character encoding of php files is by no means sufficient, except in the simplest cases.

Strings of text are to be found in various places, such as:

in the php files, as literals;
inside memory, at runtime;
in non-php data files stored on the server;
in the database;
as presented to the user (e.g. html document) and as received from them (e.g. form submission);
etc.

If you change the character encoding in (1), you necessarily change the encoding in (2), unless you wrap your literals with some function that performs the conversion in the other direction at runtime. And if you change the encoding in (2), you should be very careful when your text flows from and to (3), (4), (5) and (6): you should either change the encoding at those places, or make sure that proper conversion is done at the boundaries of those domains.

Also, mechanical conversion is not the whole story. For example, if you change the encoding in (5), you should not forget to adapt the <meta charset> tag and/or the content-type http header.

Also, all strings are not text, and only a human can decide whether the literal “\xe9” in a random location is meant to encode the raw byte 0xE9 or the character “é” in latin-1.

Of course, because we live in an interesting world, there will be situations where the encoding is unknown or ambiguous. Yuya mentioned the case of Shift-JIS which has various incompatible variants, and I am happy not to have encountered such ambiguities (only unknownnesses) when I converted our code base from windows-1252 (aka latin-1) to utf-8 a few years ago.

—Claude

1 year ago by youkidearitai — view source

unread

Many languages like Rust only support UTF-8
(https://doc.rust-lang.org/reference/input-format.html), and I don't
think any new PHP developers will expect PHP to work with non-UTF8
encodings in the first place.

Hi,

PSR-1 is required use UTF-8.

https://www.php-fig.org/psr/psr-1/
Files MUST use only UTF-8 without BOM for PHP code.

And, Rust is newer than PHP that is very long history.
If we were compare in PHP, almost we would compare language same old
year. Java, Perl, Ruby and Python etc.
(Java's default encoding is UTF-16).

Therefore, I think we should stay with PSR-1 "MUST use only UTF-8 without BOM".

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Robert Landers — view source

unread

Many languages like Rust only support UTF-8
(https://doc.rust-lang.org/reference/input-format.html), and I don't
think any new PHP developers will expect PHP to work with non-UTF8
encodings in the first place.

Hi,

PSR-1 is required use UTF-8.

https://www.php-fig.org/psr/psr-1/
Files MUST use only UTF-8 without BOM for PHP code.

And, Rust is newer than PHP that is very long history.
If we were compare in PHP, almost we would compare language same old
year. Java, Perl, Ruby and Python etc.
(Java's default encoding is UTF-16).

Therefore, I think we should stay with PSR-1 "MUST use only UTF-8 without BOM".

Regards
Yuya

--

Yuya Hamada (tekimen)

https://tekitoh-memdhoi.info

https://github.com/youkidearitai

--

To unsubscribe, visit: https://www.php.net/unsub.php

PSR also says that code should use spaces instead of tabs. Should PHP
stop parsing code that uses tabs instead of spaces?

I don't think that PSR has any relevance to this conversation because
it is too opinionated. Sometimes that opinion helps, and sometimes
it gets in the way and stifles innovation and creativity.

Robert Landers
Software Engineer
Utrecht NL

1 year ago by youkidearitai — view source

unread

Hi,

PSR also says that code should use spaces instead of tabs. Should PHP
stop parsing code that uses tabs instead of spaces?

I don't think that PSR has any relevance to this conversation because
it is too opinionated. Sometimes that opinion helps, and sometimes
it gets in the way and stifles innovation and creativity.

The point is not essential.
I want to only say that should not deprecate zend.string_encoding.
And I say only character code. Please don't miss the point.

Regards
Yuya

--

Yuya Hamada (tekimen)

1 year ago by Kentaro Takeda via internals — view source

unread

The migration path is to convert the legacy-encoding PHP files to UTF-8.

Please take a look at the following code.
This is a part of the code that I am actually maintaining in the
latest version of php.

&lt;?php
pg_connect(/* omission */);

// The database server expects clients to perform queries in SJIS.
// Depending on the settings, it may not be necessary to specify it explicitly.
pg_set_client_encoding('SJIS');

$res = pg_query('select * from 表');

Unfortunately, this code breaks if I simply convert it to UTF-8.

In the "Usage statistics of character encodings for websites"
published by W3Techs, it is true that encodings other than UTF-8 are
rarely used. However, this is only within the range that can be
observed from the outside as a website .

As the code above shows, PHP covers a much wider area. In addition to
external connections, for example, SimpleXML and DOMDocument also
handle character codes internally, so they can break down using the
same logic as in the example above.

As Yuya says, the conversion itself is difficult, and even if you can
convert it, it may not be enough, so as a php user from a culture that
uses multi-byte characters, please be aware of this.

Kentaro Takeda