Hi,
I just submitted a PR (https://github.com/php/php-src/pull/588) to
allow utf-8 chars to be included in file names that are put into a
phar file.
I thought I'd ask for feedback here as it would be good to get someone
who understands re2c better than I do, to check the code, as this
seems like such an obvious change, that it's surprising it hasn't been
done before - aka have I missed an obvious reason why it shouldn't be
done?
cheers
Dan
Hi Dan,
I just submitted a PR (https://github.com/php/php-src/pull/588) to
allow utf-8 chars to be included in file names that are put into a
phar file.
I think it better to have Unicode normalization. (NFC) Otherwise, there may
be multiple files which seem have the same file name. To be perfect, you
may detect platform and convert to NFD where it is applicable. e.g. OSX.
You can use ICU for normalization.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Yasuo,
I'm not sure I understand you. Do you have any example filenames that
I can test against to make sure that different filenames don't 'appear
the same'?
Also, the filenames used in phar files are not exposed to the
underlying system. They are held completely within the PHP phar file
and shouldn't be affected by platform. The restriction on characters
was caused by ext/phar explicitly rejecting utf-8 multibyte
characters.
cheers
Dan
Hi Dan,
On Fri, Feb 14, 2014 at 10:00 AM, Dan Ackroyd danack@basereality.comwrote:
I'm not sure I understand you. Do you have any example filenames that
I can test against to make sure that different filenames don't 'appear
the same'?
You can create filenames appears the same, but has different
representation with NFC/NFD normalization. For instance, "がぎぐげご.txt"
will have different byte pattern, since NFD decomposes 「が」into
「か」and 「゛」, and so on.
Windows and Linux's Unicode seems to use NFC, but it is a coincidence
as they only use composed form of Unicode. i.e. They don't compose
intentionally.
OSX decompose intentionally. Decomposed filenames will appear
the same on Windows and Linux and possible to have 2 files with the
same name semantically.
NOTE: OSX's NFD differs from Unicode standard a little.
Older subversion/git didn't take care normalization difference and created
multiple filenames that appear the same when user uses both OSX and
Windows/Linux.
Also, the filenames used in phar files are not exposed to the
underlying system. They are held completely within the PHP phar file
and shouldn't be affected by platform. The restriction on characters
was caused by ext/phar explicitly rejecting utf-8 multibyte
characters.
I don't use phar much. It's possible use it as archive, right?
https://php.net/phar.extractto
I think it's great change even without normalization. It's better
if normalization is taken care of.
To handle normalization difference, you may apply NFC normalization
on OSX.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Dan,
To handle normalization difference, you may apply NFC normalization
on OSX.
All you have to do is "detect OSX by #if during build" and "apply NFC
normalization
for Unicode filenames using ICU".
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Dan,
To handle normalization difference, you may apply NFC normalization
on OSX.All you have to do is "detect OSX by #if during build" and "apply NFC
normalization
for Unicode filenames using ICU".
Or even simpler, NFC normalization doesn't destroy (To be exact, it could.
It does not
matter under normal circumstances) names, you can simply apply NFC always.
I suppose performance is not a issue here.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Yasuo,
That is not an issue as:
i) Phar files produced on a windows machine should be identical to
those produced on a Linux or OSX box.
ii) There is a test in the phar code, so that if you do have filenames
that are degenerate after normalising, the extraction throws an error.
e.g. for the files
$filename1 = "Am\xC3\xA9lie.txt";
$filename2 = "Am\x65\xCC\x81lie.txt";
If you add both to a phar archive and then attempt to extract them
both you get the error:
"Cannot extract "Amélie.txt" to "output/Amélie.txt", path already exists"
cheers
Dan
Hi Dan,
That is not an issue as:
i) Phar files produced on a windows machine should be identical to
those produced on a Linux or OSX box.ii) There is a test in the phar code, so that if you do have filenames
that are degenerate after normalising, the extraction throws an error.
e.g. for the files$filename1 = "Am\xC3\xA9lie.txt"; $filename2 = "Am\x65\xCC\x81lie.txt";
If you add both to a phar archive and then attempt to extract them
both you get the error:"Cannot extract "Amélie.txt" to "output/Amélie.txt", path already
exists"
I suppose there is no normalization code in phar, so your system(OS / file
system) normalizes file name.
Depending on system's normalization is not good.
- File name could be NFC or NFD
- File names in phar may differ by system
- Systems that do not normalize Unicode actively exist
I do see file name normalization issue on my Linux/Windows and OSX with
git. (core.precomposeunicode=true is required for correct operation on OSX)
I suggest to apply NFC normalization to avoid issue, like git.
core.precomposeunicode
This option is only used by Mac OS implementation of Git. When
core.precomposeunicode=true, Git reverts the unicode decomposition of
filenames done by Mac OS. This is useful when sharing a repository between
Mac OS and Linux or Windows. (Git for Windows 1.7.10 or higher is needed,
or Git under cygwin 1.7). When false, file names are handled fully
transparent by Git, which is backward compatible with older versions of Git.
http://git-scm.com/docs/git-config
As Rowan pointed out, although ICU is detected by acinclude.m4 always, #if
should be used for ICU/intl related code. (intl uses ICU, use intl = use
ICU. I think it's better not to rely on intl. It may be disabled or can be
DL module. There are systems without ICU also.)
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Yasuo wrote:
File names in phar may differ by system
No. No they won't.
They will be exactly as they are specified when they are added by the
user to the Phar archive.
cheers
Dan
Hi Dan,
That is not an issue as:
i) Phar files produced on a windows machine should be identical to
those produced on a Linux or OSX box.ii) There is a test in the phar code, so that if you do have filenames
that are degenerate after normalising, the extraction throws an error.
e.g. for the files$filename1 = "Am\xC3\xA9lie.txt"; $filename2 = "Am\x65\xCC\x81lie.txt";
If you add both to a phar archive and then attempt to extract them
both you get the error:"Cannot extract "Amélie.txt" to "output/Amélie.txt", path already
exists"
I suppose there is no normalization code in phar, so your system(OS / file
system) normalizes file name.Depending on system's normalization is not good.
- File name could be NFC or NFD
- File names in phar may differ by system
- Systems that do not normalize Unicode actively exist
I do see file name normalization issue on my Linux/Windows and OSX with git.
(core.precomposeunicode=true is required for correct operation on OSX) I
suggest to apply NFC normalization to avoid issue, like git.core.precomposeunicode
This option is only used by Mac OS implementation of Git. When
core.precomposeunicode=true, Git reverts the unicode decomposition of
filenames done by Mac OS. This is useful when sharing a repository between
Mac OS and Linux or Windows. (Git for Windows 1.7.10 or higher is needed, or
Git under cygwin 1.7). When false, file names are handled fully transparent
by Git, which is backward compatible with older versions of Git.
http://git-scm.com/docs/git-configAs Rowan pointed out, although ICU is detected by acinclude.m4 always, #if
should be used for ICU/intl related code. (intl uses ICU, use intl = use
ICU. I think it's better not to rely on intl. It may be disabled or can be
DL module. There are systems without ICU also.)Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Dan,
On Sat, Feb 15, 2014 at 10:07 AM, Dan Ackroyd danack@basereality.comwrote:
File names in phar may differ by system
No. No they won't.
They will be exactly as they are specified when they are added by the
user to the Phar archive.
There is a good reason why recent git has core.precomposeunicode option.
This is the option compose(NFC) Unicode file names.
Do not ignore, please.
Or I'm missing something if it is composed on all platforms. I'm supposing
current phar does not have compose feature, does it?
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Yasuo Ohgaki wrote (on 14/02/2014):
You can use ICU for normalization.
Making ICU a dependency for PHAR could be pretty onerous. I think I'm
right in saying that only the "intl" extension relies on that at the moment?
I seem to remember Rasmus saying that he saw the requirement to bundle
ICU as a big problem with the now-abandoned "First Unicode
Implementation" (AKA PHP6).
Regards,
Rowan Collins
[IMSoP]