Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:102943
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain ohgaki.net designates 180.42.98.130 as permitted sender)
MIME-Version: 1.0
References: <CACXBjuh01Optb_LBXw899-+i+fM9iEwfOh4zjSHkOjhPJkoD_g@mail.gmail.com>
 <CAPv8svWyfqXB_yS8T24nL=ibxt3n7fBg0nC4GWN0h9agwDAxmw@mail.gmail.com>
 <CANUQDCic9DdUL-a7JjjaZq+QL3TH7=WOn9GmYyJTuYbWQdONCQ@mail.gmail.com>
 <3ce44a21a935f3d458bd4fea99db89a4fd2c9603.camel@ku.edu> <CACXBjujgbzX5AHJzO+6HrMTt0=yqXz0XOxZuVCThjrKaj2ntew@mail.gmail.com>
In-Reply-To: <CACXBjujgbzX5AHJzO+6HrMTt0=yqXz0XOxZuVCThjrKaj2ntew@mail.gmail.com>
Date: Sun, 22 Jul 2018 07:01:26 +0900
Message-ID: <CAGa2bXaoRL-J999+r7rn4LvGxZDJE8VS0yqqnMxZJ4P8zBy-gw@mail.gmail.com>
To: rasmus@lerdorf.com
Cc: zrhoffman@ku.edu, mapopa@gmail.com, me@kelunik.com, 
	internals@lists.php.net
Content-Type: multipart/alternative; boundary="000000000000ea6cdf0571898f70"
Subject: Re: [PHP-DEV] bugs.php.net downtime
From: yohgaki@ohgaki.net (Yasuo Ohgaki)

--000000000000ea6cdf0571898f70
Content-Type: text/plain; charset="UTF-8"

On Sat, Jul 21, 2018 at 10:14 AM Rasmus Lerdorf <rasmus@lerdorf.com> wrote:

> Other than the autoincrement they are identical. I normally use utf8mb4,
> but I figured I would play it safe and copy it over verbatim. I guess it
> wasn't safe.
>

Right. There are risks.

For example, encoding like SJIS contains \ as a part of valid char.
When encoding is mixed, escape could be disabled and injections are
possible.

Even when UTF-8 is used, mixed invalid encoding handling can break security
measures. e.g. Invalid UTF-8 encoding that is missing the last multibyte
byte.

When santaization is required, programmers have two choices.
 - Remove all bytes specified by MSB of UTF-8 first byte. i.e. Consume the
last byte.
 - Remove only bytes that are invalid as UTF-8. i.e Leave last ASCII char,
for example

If these designs are mixed, encoding attack is possible also.

DoS by invalid char is trivial. Current web browsers can refuse to render
entire page
that has badly broken encoding.

The only good countermeasure against encoding attack is encoding validation
with
Fail Fast principle. i.e. Validate encoding at application software's outer
most trust
boundary.

I'll do some research, but ideas welcome.
>

IMO, all data should be converted to valid UTF-8 encoding as we use
UTF-8 as bugs.php.net encoding. Replace invalid date to "?" or something
else.
Some data will be lost, however valid char encoding is mandatory for
correct data
handling as described above.

In order to replace invalid char to "?" (or something else),
mb_convert_encoding()
can be used. "mbstring.substitute_character" INI is for specifying
replacing char.
Default is none, so it removes invalid data by default.

If you would like to keep original data, number of detected invalid chars
are recorded
and can be retrieved by mb_get_info()'s array. "illegal_chars" is "Total
number of
illegal chars in the script's lifetime". By checking this, invalid char
existence can be
checked.
(Alternatively, simply comparing original and converted data works also.)

You might want to count number of all illegal chars in the db before
converting
data. "illegal_chars" is handy for this.

Old data may be added to converted data by using base64 if it's necessary.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

--000000000000ea6cdf0571898f70--