utf-8 filenames in phar files.

11 years ago by Dan Ackroyd — view source

unread

Hi,

I just submitted a PR (https://github.com/php/php-src/pull/588) to
allow utf-8 chars to be included in file names that are put into a
phar file.

I thought I'd ask for feedback here as it would be good to get someone
who understands re2c better than I do, to check the code, as this
seems like such an obvious change, that it's surprising it hasn't been
done before - aka have I missed an obvious reason why it shouldn't be
done?

cheers
Dan

11 years ago by Yasuo Ohgaki — view source

unread

Hi Dan,

I just submitted a PR (https://github.com/php/php-src/pull/588) to
allow utf-8 chars to be included in file names that are put into a
phar file.

I think it better to have Unicode normalization. (NFC) Otherwise, there may
be multiple files which seem have the same file name. To be perfect, you
may detect platform and convert to NFD where it is applicable. e.g. OSX.
You can use ICU for normalization.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Dan Ackroyd — view source

unread

Hi Yasuo,

I'm not sure I understand you. Do you have any example filenames that
I can test against to make sure that different filenames don't 'appear
the same'?

Also, the filenames used in phar files are not exposed to the
underlying system. They are held completely within the PHP phar file
and shouldn't be affected by platform. The restriction on characters
was caused by ext/phar explicitly rejecting utf-8 multibyte
characters.

cheers
Dan

11 years ago by Yasuo Ohgaki — view source

unread

Hi Dan,

On Fri, Feb 14, 2014 at 10:00 AM, Dan Ackroyd danack@basereality.comwrote:

I'm not sure I understand you. Do you have any example filenames that
I can test against to make sure that different filenames don't 'appear
the same'?

You can create filenames appears the same, but has different
representation with NFC/NFD normalization. For instance, "がぎぐげご.txt"
will have different byte pattern, since NFD decomposes 「が」into
「か」and 「゛」, and so on.

Windows and Linux's Unicode seems to use NFC, but it is a coincidence
as they only use composed form of Unicode. i.e. They don't compose
intentionally.

OSX decompose intentionally. Decomposed filenames will appear
the same on Windows and Linux and possible to have 2 files with the
same name semantically.
NOTE: OSX's NFD differs from Unicode standard a little.

Older subversion/git didn't take care normalization difference and created
multiple filenames that appear the same when user uses both OSX and
Windows/Linux.

Also, the filenames used in phar files are not exposed to the

underlying system. They are held completely within the PHP phar file
and shouldn't be affected by platform. The restriction on characters
was caused by ext/phar explicitly rejecting utf-8 multibyte
characters.

I don't use phar much. It's possible use it as archive, right?
https://php.net/phar.extractto

I think it's great change even without normalization. It's better
if normalization is taken care of.

To handle normalization difference, you may apply NFC normalization
on OSX.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

Dan,

To handle normalization difference, you may apply NFC normalization
on OSX.

All you have to do is "detect OSX by #if during build" and "apply NFC
normalization
for Unicode filenames using ICU".

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

Hi Dan,

To handle normalization difference, you may apply NFC normalization
on OSX.

All you have to do is "detect OSX by #if during build" and "apply NFC
normalization
for Unicode filenames using ICU".

Or even simpler, NFC normalization doesn't destroy (To be exact, it could.
It does not
matter under normal circumstances) names, you can simply apply NFC always.
I suppose performance is not a issue here.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Dan Ackroyd — view source

unread

Hi Yasuo,

That is not an issue as:

i) Phar files produced on a windows machine should be identical to
those produced on a Linux or OSX box.

ii) There is a test in the phar code, so that if you do have filenames
that are degenerate after normalising, the extraction throws an error.
e.g. for the files

$filename1 = "Am\xC3\xA9lie.txt";
$filename2 = "Am\x65\xCC\x81lie.txt";

If you add both to a phar archive and then attempt to extract them
both you get the error:

"Cannot extract "Amélie.txt" to "output/Amélie.txt", path already exists"

cheers
Dan

11 years ago by Yasuo Ohgaki — view source

unread

Hi Dan,

That is not an issue as:

i) Phar files produced on a windows machine should be identical to
those produced on a Linux or OSX box.

ii) There is a test in the phar code, so that if you do have filenames
that are degenerate after normalising, the extraction throws an error.
e.g. for the files
$filename1 = "Am\xC3\xA9lie.txt";
$filename2 = "Am\x65\xCC\x81lie.txt";
If you add both to a phar archive and then attempt to extract them
both you get the error:
"Cannot extract "Amélie.txt" to "output/Amélie.txt", path already
exists"

I suppose there is no normalization code in phar, so your system(OS / file
system) normalizes file name.

Depending on system's normalization is not good.

File name could be NFC or NFD
File names in phar may differ by system
Systems that do not normalize Unicode actively exist

I do see file name normalization issue on my Linux/Windows and OSX with
git. (core.precomposeunicode=true is required for correct operation on OSX)
I suggest to apply NFC normalization to avoid issue, like git.

core.precomposeunicode
This option is only used by Mac OS implementation of Git. When
core.precomposeunicode=true, Git reverts the unicode decomposition of
filenames done by Mac OS. This is useful when sharing a repository between
Mac OS and Linux or Windows. (Git for Windows 1.7.10 or higher is needed,
or Git under cygwin 1.7). When false, file names are handled fully
transparent by Git, which is backward compatible with older versions of Git.
http://git-scm.com/docs/git-config

As Rowan pointed out, although ICU is detected by acinclude.m4 always, #if
should be used for ICU/intl related code. (intl uses ICU, use intl = use
ICU. I think it's better not to rely on intl. It may be disabled or can be
DL module. There are systems without ICU also.)

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Dan Ackroyd — view source

unread

Yasuo wrote:

File names in phar may differ by system

No. No they won't.

They will be exactly as they are specified when they are added by the
user to the Phar archive.

cheers
Dan

Hi Dan,
That is not an issue as:

i) Phar files produced on a windows machine should be identical to
those produced on a Linux or OSX box.

ii) There is a test in the phar code, so that if you do have filenames
that are degenerate after normalising, the extraction throws an error.
e.g. for the files
$filename1 = "Am\xC3\xA9lie.txt";
$filename2 = "Am\x65\xCC\x81lie.txt";
If you add both to a phar archive and then attempt to extract them
both you get the error:
"Cannot extract "Amélie.txt" to "output/Amélie.txt", path already
exists"
I suppose there is no normalization code in phar, so your system(OS / file
system) normalizes file name.

Depending on system's normalization is not good.

File name could be NFC or NFD

File names in phar may differ by system

Systems that do not normalize Unicode actively exist

I do see file name normalization issue on my Linux/Windows and OSX with git.
(core.precomposeunicode=true is required for correct operation on OSX) I
suggest to apply NFC normalization to avoid issue, like git.

core.precomposeunicode
This option is only used by Mac OS implementation of Git. When
core.precomposeunicode=true, Git reverts the unicode decomposition of
filenames done by Mac OS. This is useful when sharing a repository between
Mac OS and Linux or Windows. (Git for Windows 1.7.10 or higher is needed, or
Git under cygwin 1.7). When false, file names are handled fully transparent
by Git, which is backward compatible with older versions of Git.
http://git-scm.com/docs/git-config

As Rowan pointed out, although ICU is detected by acinclude.m4 always, #if
should be used for ICU/intl related code. (intl uses ICU, use intl = use
ICU. I think it's better not to rely on intl. It may be disabled or can be
DL module. There are systems without ICU also.)

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

Hi Dan,

On Sat, Feb 15, 2014 at 10:07 AM, Dan Ackroyd danack@basereality.comwrote:

File names in phar may differ by system

No. No they won't.

They will be exactly as they are specified when they are added by the
user to the Phar archive.

There is a good reason why recent git has core.precomposeunicode option.
This is the option compose(NFC) Unicode file names.
Do not ignore, please.

Or I'm missing something if it is composed on all platforms. I'm supposing
current phar does not have compose feature, does it?

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Rowan Collins — view source

unread

Yasuo Ohgaki wrote (on 14/02/2014):

You can use ICU for normalization.

Making ICU a dependency for PHAR could be pretty onerous. I think I'm
right in saying that only the "intl" extension relies on that at the moment?

I seem to remember Rasmus saying that he saw the requirement to bundle
ICU as a big problem with the now-abandoned "First Unicode
Implementation" (AKA PHP6).

Regards,

Rowan Collins
[IMSoP]

11 years ago by Lester Caine — view source

unread

My previous post did not appear on the list ;)

Yasuo Ohgaki wrote:

A lot of the current confusion does seem to be based around the Windows
Wide-API as documented in 'The Problem' section of that document. It would
seem that my 'naive' view of simply using UTF-8 strings is thwarted by these
problems?--
Unicode is like one name with several encoding. We cannot get away from
conversions, normalization especially.

That is why personally I'm just looking at UTF8. Which is enough of a mine field
on it's own, but since a large swath of what we are working with now is only
UTF8 it does seem to be the right base going forward?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

11 years ago by Yasuo Ohgaki — view source

unread

My previous post did not appear on the list ;)

Yasuo Ohgaki wrote:
A lot of the current confusion does seem to be based around the
Windows
Wide-API as documented in 'The Problem' section of that document. It
would
seem that my 'naive' view of simply using UTF-8 strings is thwarted
by these
problems?--

Unicode is like one name with several encoding. We cannot get away from
conversions, normalization especially.
That is why personally I'm just looking at UTF8. Which is enough of a mine
field on it's own, but since a large swath of what we are working with now
is only UTF8 it does seem to be the right base going forward?

I have problem, too. It seems someone is working on DKIM.
Anyway, there are problems, but UTF-8 is way to go. We just cannot remove
conversions. Normalization has number of issues, including security related
one.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Jakub Zelenka — view source

unread

My previous post did not appear on the list ;)

Yasuo Ohgaki wrote:
A lot of the current confusion does seem to be based around the
Windows
Wide-API as documented in 'The Problem' section of that document. It
would
seem that my 'naive' view of simply using UTF-8 strings is thwarted
by these
problems?--

Unicode is like one name with several encoding. We cannot get away from
conversions, normalization especially.
That is why personally I'm just looking at UTF8. Which is enough of a
mine
field on it's own, but since a large swath of what we are working with
now
is only UTF8 it does seem to be the right base going forward?
I have problem, too. It seems someone is working on DKIM.
Anyway, there are problems, but UTF-8 is way to go. We just cannot remove
conversions. Normalization has number of issues, including security related
one.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

Hi Stas,

I just saw that the patch has been merged. There was an objection from
Yasuo about normalization in past...

What's worse the patch doesn't check UTF-8 correctly. It accepts invalid
UTF-8. The correct spec is in
https://tools.ietf.org/html/rfc3629#section-4. I have got re2c
implementation in jsond scanner :
https://github.com/bukka/php-jsond/blob/master/jsond_scanner.re#L122-L134 .
As you can see it's very different from the provided impl.

I think that accepting ill-formed UTF-8 would be a mistake and as such the
patch should be reverted.

Thanks

Jakub

11 years ago by Jakub Zelenka — view source

unread

My previous post did not appear on the list ;)

Yasuo Ohgaki wrote:
A lot of the current confusion does seem to be based around the
Windows
Wide-API as documented in 'The Problem' section of that document.
It
would
seem that my 'naive' view of simply using UTF-8 strings is thwarted
by these
problems?--

Unicode is like one name with several encoding. We cannot get away from
conversions, normalization especially.
That is why personally I'm just looking at UTF8. Which is enough of a
mine
field on it's own, but since a large swath of what we are working with
now
is only UTF8 it does seem to be the right base going forward?
I have problem, too. It seems someone is working on DKIM.
Anyway, there are problems, but UTF-8 is way to go. We just cannot remove
conversions. Normalization has number of issues, including security
related
one.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Stas,

I just saw that the patch has been merged. There was an objection from
Yasuo about normalization in past...

What's worse the patch doesn't check UTF-8 correctly. It accepts invalid
UTF-8. The correct spec is in
https://tools.ietf.org/html/rfc3629#section-4 . I have got re2c
implementation in jsond scanner :
https://github.com/bukka/php-jsond/blob/master/jsond_scanner.re#L122-L134. As you can see it's very different from the provided impl.

I think that accepting ill-formed UTF-8 would be a mistake and as such the
patch should be reverted.

Thanks

Jakub

I have created a quick PR: https://github.com/php/php-src/pull/649 that is
fixing the ill-formed UTF-8 paths.

Jakub

11 years ago by Stas Malyshev — view source

unread

Hi!

I have created a quick PR: https://github.com/php/php-src/pull/649 that
is fixing the ill-formed UTF-8 paths.

Thanks for the patch. One thing I'd like to understand is what is the
added value of being so strict in checking UTF-8. I.e. what would happen
if we allow some path with weird chars in?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

11 years ago by Yasuo Ohgaki — view source

unread

Hi Stas,

On Tue, Apr 22, 2014 at 8:06 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

I have created a quick PR: https://github.com/php/php-src/pull/649 that
is fixing the ill-formed UTF-8 paths.

Thanks for the patch. One thing I'd like to understand is what is the
added value of being so strict in checking UTF-8. I.e. what would happen
if we allow some path with weird chars in?

Although invalid encoding would not be security issues by itselves, invalid
encoding
creates various uncertainties. There are/were many ways to use it to
exploit.
e.g. Old browsers had many security issues with ill-formed strings.
One valid example I can think of right now is filter evasion.

http://capec.mitre.org/data/definitions/80.html

Another is DoS. Browsers may refuse to render page at all when there is
ill-formed
strings. e.g. Recent Chrome. Yet another is injections. i.e If user assumes
path name
encoding is UTF-8 and didn't escape, their program could be vulnerable to
injections.

Other programs are getting better to deal with invalid encodings, but
leaving invalid
encoding relies on other programmer's code for proper/safe operations. This
is not good.
Any external inputs that have certain form must be validated where it is
possible.
This way, we would not leave uncertainties/risks.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Yasuo Ohgaki — view source

unread

On Tue, Apr 22, 2014 at 8:06 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

I have created a quick PR: https://github.com/php/php-src/pull/649 that
is fixing the ill-formed UTF-8 paths.

Thanks for the patch. One thing I'd like to understand is what is the
added value of being so strict in checking UTF-8. I.e. what would happen
if we allow some path with weird chars in?

Although invalid encoding would not be security issues by itselves,
invalid encoding
creates various uncertainties. There are/were many ways to use it to
exploit.
e.g. Old browsers had many security issues with ill-formed strings.
One valid example I can think of right now is filter evasion.

http://capec.mitre.org/data/definitions/80.html

Another is DoS. Browsers may refuse to render page at all when there is
ill-formed
strings. e.g. Recent Chrome. Yet another is injections. i.e If user
assumes path name
encoding is UTF-8 and didn't escape, their program could be vulnerable to
injections.

Other programs are getting better to deal with invalid encodings, but
leaving invalid
encoding relies on other programmer's code for proper/safe operations.
This is not good.
Any external inputs that have certain form must be validated where it is
possible.
This way, we would not leave uncertainties/risks.

BTW, without NFC normalization, I sure there will be unhappy users if users
use it with
OSX and Linux/Windows. OSX decomposes Unicode and there will be the same
name
path with different unicode string that appears the same on their
terminal/etc on Linux/Windows.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Lester Caine — view source

unread

Yasuo Ohgaki wrote:

BTW, without NFC normalization, I sure there will be unhappy users if users use
it with
OSX and Linux/Windows. OSX decomposes Unicode and there will be the same name
path with different unicode string that appears the same on their terminal/etc
on Linux/Windows.

I don't think this problem is any different to the simple conflict between upper
and lower case 'normalizing' that happens currently? Each OS has it's own
standards and quirks which we have to put up with. It is a simple fact that
UTF-8 does NOT have a preferred standard, and everything that is valid has to be
handled. This is back to the question on case insensitive comparisons, and if
even that can be supported going forward. If different OS's 'normalise' a string
for their own purposes can we be expected to provide different comparison rules
for each? Or is it something that has to be passed back up the chain for a
library to handle more generically?

Phar should not 'translate' anything ... it is where these strings are used that
should handle any additional processing?

--
Lester Caine - G8HFL

11 years ago by Yasuo Ohgaki — view source

unread

Hi Lester,

Yasuo Ohgaki wrote:

BTW, without NFC normalization, I sure there will be unhappy users if
users use
it with
OSX and Linux/Windows. OSX decomposes Unicode and there will be the same
name
path with different unicode string that appears the same on their
terminal/etc
on Linux/Windows.

I don't think this problem is any different to the simple conflict between
upper and lower case 'normalizing' that happens currently? Each OS has it's
own standards and quirks which we have to put up with. It is a simple fact
that UTF-8 does NOT have a preferred standard, and everything that is valid
has to be handled. This is back to the question on case insensitive
comparisons, and if even that can be supported going forward. If different
OS's 'normalise' a string for their own purposes can we be expected to
provide different comparison rules for each? Or is it something that has to
be passed back up the chain for a library to handle more generically?

Phar should not 'translate' anything ... it is where these strings are
used that should handle any additional processing?

Phar could be extracted. Path name composition is mandatory for
compatibility between OSX and Linux/Windows, since
OSX decomposes path name intentionally. If you are curious, research how
git works with OSX.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Lester Caine — view source

unread

Phar should not 'translate' anything ... it is where these strings
are used that should handle any additional processing?
Phar could be extracted. Path name composition is mandatory for
compatibility between OSX and Linux/Windows, since
OSX decomposes path name intentionally. If you are curious, research how
git works with OSX.

http://mercurial.selenic.com/wiki/SummerOfCode/Ideas2014#Unicode_filename_support_on_Windows
could be worth a look at as well?

--
Lester Caine - G8HFL

11 years ago by Stas Malyshev — view source

unread

Hi!

e.g. Old browsers had many security issues with ill-formed strings.
One valid example I can think of right now is filter evasion.
Another is DoS. Browsers may refuse to render page at all when there is
ill-formed
strings. e.g. Recent Chrome. Yet another is injections. i.e If user
assumes path name
encoding is UTF-8 and didn't escape, their program could be vulnerable
to injections.

I'm sure these are serious issues, but I'm not sure - how do they relate
to the phar fix we're talking about?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

11 years ago by Lester Caine — view source

unread

Stas Malyshev wrote:

I'm sure these are serious issues, but I'm not sure - how do they relate
to the phar fix we're talking about?

There is no logical reason to allow invalid UTF-8 data to be stored, so step one
has to be to prevent it's use ... at least then one does not have to discuss
just what happens to invalid data ... it's an error.

--
Lester Caine - G8HFL

11 years ago by Jakub Zelenka — view source

unread

On Tue, Apr 22, 2014 at 12:06 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

Hi!

I have created a quick PR: https://github.com/php/php-src/pull/649 that
is fixing the ill-formed UTF-8 paths.

Thanks for the patch. One thing I'd like to understand is what is the
added value of being so strict in checking UTF-8. I.e. what would happen
if we allow some path with weird chars in?

I think that validation is important to prevent user errors. The problem is
that the currently accepted implementation (MB2, MB3, MB4) just pretending
that validates UTF-8. However the validation is incorrect. It doesn't allow
to use weird characters (there is already check for UTF-8 sequences) but it
allows surrogate pair code points which is wrong IMHO. The PR fixes that.
It just correctly checks for ill-formed charcters.

In regards to normalization I think that Yasuo is right that there will be
unhappy users. On the other side, I think that there are more users that
would like to use UTF-8 paths at all. Normalization is a bit tricky and the
only solution that comes to my mind ATM is dependency on ICU which wouldn't
be right IMHO.

Jakub

utf-8 filenames in phar files.

Regards,

-- Lester Caine - G8HFL

Thanks for the patch. One thing I'd like to understand is what is the added value of being so strict in checking UTF-8. I.e. what would happen if we allow some path with weird chars in?

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

--
Lester Caine - G8HFL

Thanks for the patch. One thing I'd like to understand is what is the
added value of being so strict in checking UTF-8. I.e. what would happen
if we allow some path with weird chars in?

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL