Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:102943 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 24749 invoked from network); 21 Jul 2018 22:02:16 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Jul 2018 22:02:16 -0000 Authentication-Results: pb1.pair.com header.from=yohgaki@ohgaki.net; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=yohgaki@ohgaki.net; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ohgaki.net designates 180.42.98.130 as permitted sender) X-PHP-List-Original-Sender: yohgaki@ohgaki.net X-Host-Fingerprint: 180.42.98.130 ns1.es-i.jp Received: from [180.42.98.130] ([180.42.98.130:57740] helo=es-i.jp) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 9F/C8-47674-46DA35B5 for ; Sat, 21 Jul 2018 18:02:14 -0400 Received: (qmail 71949 invoked by uid 89); 21 Jul 2018 22:02:09 -0000 Received: from unknown (HELO mail-yw0-f169.google.com) (yohgaki@ohgaki.net@209.85.161.169) by 0 with ESMTPA; 21 Jul 2018 22:02:09 -0000 Received: by mail-yw0-f169.google.com with SMTP id v197-v6so5598769ywg.3 for ; Sat, 21 Jul 2018 15:02:08 -0700 (PDT) X-Gm-Message-State: AOUpUlGzqE31qBO/GNnLMNJtZsem/lQZZxuQlVYe4BaNPlwgdyOck622 y2j7PTZzKpsQV+ifY3qJIuKGzvccLJbEiKEqzg== X-Google-Smtp-Source: AAOMgpclC27mw8vTqNaei0w557vhMLwfz7gabR1mH5bNh5On8tn8Zo8iL2OHspWQGndGLBMrkOzDPsOM/SGT5t6KRq4= X-Received: by 2002:a81:6ca:: with SMTP id 193-v6mr3819476ywg.399.1532210522509; Sat, 21 Jul 2018 15:02:02 -0700 (PDT) MIME-Version: 1.0 References: <3ce44a21a935f3d458bd4fea99db89a4fd2c9603.camel@ku.edu> In-Reply-To: Date: Sun, 22 Jul 2018 07:01:26 +0900 X-Gmail-Original-Message-ID: Message-ID: To: rasmus@lerdorf.com Cc: zrhoffman@ku.edu, mapopa@gmail.com, me@kelunik.com, internals@lists.php.net Content-Type: multipart/alternative; boundary="000000000000ea6cdf0571898f70" Subject: Re: [PHP-DEV] bugs.php.net downtime From: yohgaki@ohgaki.net (Yasuo Ohgaki) --000000000000ea6cdf0571898f70 Content-Type: text/plain; charset="UTF-8" On Sat, Jul 21, 2018 at 10:14 AM Rasmus Lerdorf wrote: > Other than the autoincrement they are identical. I normally use utf8mb4, > but I figured I would play it safe and copy it over verbatim. I guess it > wasn't safe. > Right. There are risks. For example, encoding like SJIS contains \ as a part of valid char. When encoding is mixed, escape could be disabled and injections are possible. Even when UTF-8 is used, mixed invalid encoding handling can break security measures. e.g. Invalid UTF-8 encoding that is missing the last multibyte byte. When santaization is required, programmers have two choices. - Remove all bytes specified by MSB of UTF-8 first byte. i.e. Consume the last byte. - Remove only bytes that are invalid as UTF-8. i.e Leave last ASCII char, for example If these designs are mixed, encoding attack is possible also. DoS by invalid char is trivial. Current web browsers can refuse to render entire page that has badly broken encoding. The only good countermeasure against encoding attack is encoding validation with Fail Fast principle. i.e. Validate encoding at application software's outer most trust boundary. I'll do some research, but ideas welcome. > IMO, all data should be converted to valid UTF-8 encoding as we use UTF-8 as bugs.php.net encoding. Replace invalid date to "?" or something else. Some data will be lost, however valid char encoding is mandatory for correct data handling as described above. In order to replace invalid char to "?" (or something else), mb_convert_encoding() can be used. "mbstring.substitute_character" INI is for specifying replacing char. Default is none, so it removes invalid data by default. If you would like to keep original data, number of detected invalid chars are recorded and can be retrieved by mb_get_info()'s array. "illegal_chars" is "Total number of illegal chars in the script's lifetime". By checking this, invalid char existence can be checked. (Alternatively, simply comparing original and converted data works also.) You might want to count number of all illegal chars in the db before converting data. "illegal_chars" is handy for this. Old data may be added to converted data by using base64 if it's necessary. Regards, -- Yasuo Ohgaki yohgaki@ohgaki.net --000000000000ea6cdf0571898f70--