Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:112796 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 50663 invoked from network); 7 Jan 2021 14:47:27 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 7 Jan 2021 14:47:27 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 5B48C1804DC for ; Thu, 7 Jan 2021 06:24:10 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 7 Jan 2021 06:24:09 -0800 (PST) Received: by mail-wm1-f42.google.com with SMTP id g185so5696490wmf.3 for ; Thu, 07 Jan 2021 06:24:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=craigfrancis.co.uk; s=default; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Pop+QqSnNxV5mSwi+sbcNb4dE6nr2QyGb2hphTkS4Sg=; b=Fn111LPotAUMbeMgeaMdvOcZsTK8HU7Mmj+fHZKzuab/XtmFgzNAQvXDI2d19hgkTU +jwI9Il7JAY+rva5o8GxsieyRwPklqAf4R29vZlwYzcAsG+7v31+anHDRk0yI358fNb7 m/6Rf4XG9bkLAmXsckJmgYkoNXMkPpGnqSJcw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Pop+QqSnNxV5mSwi+sbcNb4dE6nr2QyGb2hphTkS4Sg=; b=DfPRCW4/UxV8pmJoKV4ZrEyHsJJZvhtZTHmRP0BJje1UuMV2gwh/7uvPsv4fHyRs6K 8pj6PHVwRTmIbfQNitVkE+BisdbMJf5bFY/Q2m8cSmtWJnUoB6hHYUYkNZ6oBVBS8b7B r4kWorrHrISqZ3wyw4gW8mgSa7MrzXXpzhQZ+qGgz6I5yWSZSV83KnNx2TAq10GtIj2d zp5rhDqqxzLYhGwl+KWzrcCfwjR6GceN8gYuZrMvRFWYqoXjEwibPcKTh5/k1sHwqH1y TyGaxzcWQKs+joz/4fG2JTwZ5C+Teb+HlxgOlvCtHR97e2cibrzFsOE6qo991kQyW7Ie vOuw== X-Gm-Message-State: AOAM530OFCIfMPSHMfZW+A3yeeKnGI9qMBoBjfpr2NmoX9tCdY3+Ficp tNdMforDWl8cgd3t58SYW+SC/1rUmuyg2pCL X-Google-Smtp-Source: ABdhPJxIIx1Y2l1wMspfXmeDRCHgUeXFJowaPL4AjSKhbsgvlXkEyQBP7U6qTjLMwoHV2hojzpmDNA== X-Received: by 2002:a1c:2d8b:: with SMTP id t133mr8043193wmt.127.1610029447649; Thu, 07 Jan 2021 06:24:07 -0800 (PST) Received: from [192.168.1.10] ([92.237.247.170]) by smtp.gmail.com with ESMTPSA id l20sm8864147wrh.82.2021.01.07.06.24.06 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jan 2021 06:24:06 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.40.0.2.32\)) In-Reply-To: Date: Thu, 7 Jan 2021 14:24:06 +0000 Cc: Claude Pache , Nikita Popov , Tomas Kuliavas Content-Transfer-Encoding: quoted-printable Message-ID: <982075D8-2B6B-4157-86FD-502D65144830@craigfrancis.co.uk> References: <99C71641-5A5B-49C8-8D96-F0C080352B91@gmail.com> To: PHP internals X-Mailer: Apple Mail (2.3654.40.0.2.32) Subject: Re: [PHP-DEV] ENT_COMPAT for htmlentities and htmlspecialchars From: craig@craigfrancis.co.uk (Craig Francis) On Sat, Dec 26, 2020 at 12:03 PM Craig Francis = wrote: > Could htmlspecialchars() use ENT_QUOTES by default? > [...] > I'd also be tempted to suggest ENT_SUBSTITUTE should be included, as I = prefer to keep as much of the valid data (rather than losing = everything), but that's not as important as escaping the apostrophe by = default. On Thu, 7 Jan 2021 at 09:00, Claude Pache = wrote: > For ENT_SUBSTITUTE, there has been = https://bugs.php.net/bug.php?id=3D69450, but I don=E2=80=99t understand = the objection in that bug report. Maybe there is some issue related to = non-Unicode multibyte encodings? On Thu, 7 Jan 2021 at 09:29, Tomas Kuliavas = wrote: > Only ISO-2022 encodings got bytes that can match symbols sanitized by = htmlspecialchars. >=20 > Bug objection insist that utf-8 parsing rules should be enacted by = sanitizing function and not by application which displays text. And PHP = code is enacting those rules in most unfriendly API way. Does anyone have an example where ENT_SUBSTITUTE could be used to create = an issue? ideally a security issue, but anything will do. With `htmlspecialchars($user_value)`, I don't think it would matter if = it ended with , like the example from Rasmus = (0xE0), because that end byte would be replaced by U+FFFD. With `htmlspecialchars($user_value . $system_value)`, if $user_value = ends with , it's possible some characters at the = beginning of $system_value could be replaced. But I can't find a way to = do that with UTF-8; and even if it was possible, I would have thought = some characters being replaced by U+FFFD, would be a much better = solution than everything being lost (noting that $system_value will not = contain any HTML characters, because they are escaped as well). echo '

' . htmlspecialchars($user . ' is lying to you') . '

'; =20 With: '

ABC=EF=BF=BDs lying to you

' Without: '

' And, in both cases, the output is valid UTF-8, and shouldn't affect = anything it's concatenated with (i.e. the HTML context). Personally, I think every part of our processes (input, processing, and = output) should do its best to handle encoding issues (incase something = is missed). I believe ENT_SUBSTITUTE is the best way to deal with it = during output. I don't think it's realistic to expect every single PHP = developer to check for invalid characters in every single bit of input. That said (and just to make things even more complicated), considering = this is HTML encoding, we could go even further and add ENT_DISALLOWED. = As Hans noted, some characters, such as 0x01, can be seen as valid in = general, but not valid for HTML (where Text Nodes "must not contain = control characters other than space characters"). All browsers seem to = handle these control characters (by ignoring them), so I'm not too = worried, but if it makes things safer, why not? Craig