Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:118723 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 62339 invoked from network); 1 Oct 2022 15:39:51 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 1 Oct 2022 15:39:51 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 252D31804BA for ; Sat, 1 Oct 2022 08:39:48 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com [209.85.222.177]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 1 Oct 2022 08:39:47 -0700 (PDT) Received: by mail-qk1-f177.google.com with SMTP id g2so4507502qkk.1 for ; Sat, 01 Oct 2022 08:39:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date; bh=IZN5be8eHHA0D4SZG3W7c5BpBULZsNojKLKJsXdhrs8=; b=FGjlAmVozP1YdbIBkxciLNWljDM2znSDpjrdZ4ECg6PqNZA8q0pZH370gvj+Y64v3Q B5gV/b0U6KzeG+LfLI9l0sZP/v4KX/bXRUdSmoU5E84Q4LaO2LJw7oDSXy9mBmI6poo9 trVYnHmEDUI9K5fGUzgIvYdtM9S+goru6KyqCQBxPjmgPjQzZ3c02Nl+aMYSjngtT+4y B8l8Fbcy8vBgcUbKIbbe4MewjQzjo+HBi0zZ5YPWL4hT2dXn1mUVXx7c2tmvWVSEGz40 CjiQbIX0ZChs3qSvz/pxbX+dA9QDJYUyPWHec4KvturQeaBJTtNl1f+sIasLW2wYtyWN Oi6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date; bh=IZN5be8eHHA0D4SZG3W7c5BpBULZsNojKLKJsXdhrs8=; b=IF3l6RXZBh961+R2zZtcBh8YmMNUwC7kIa2ha7xPS4rxHQF7nQRQyqOG3V/h/9H0iz ZiR1SK0wW7hjdLEROZFLeLi8om18c13dd04q4Ddy3wYXmiUBVNn44orzdz3CDnFQLlzf CDidJEN5vE/qvLMLDY51IDJcgL6Q2H+/SFwNdLcVZDbj0cR4aaoLZ1gwYtWC3WlexzII 8tdbOoVaus9gi/JSbtLmctpP5B0RvUJ8G56TMtCqXQrn7dMAZ2SjmpRF3ywdCKCB8s44 jm99m92usYl/OqfjDkEj2TTLqL/Sp2mh+yL0xCoNR5gyi2PGsBiO63QUkiRGE8T/vq7d PtiQ== X-Gm-Message-State: ACrzQf3GphCc8IUxIw3jlmmnpSqSV7yF939DE6BHmuQ2Eghs0Wor1TY8 oHtYDWpfLf5Jc/tJAVtZwVOzWFXL6Bfjpuhu9HsDNGa2I94= X-Google-Smtp-Source: AMsMyM5oSl5FckJPb05AREMKTDCvw6AusP/CP/8aNAXF1/IZMrH+nP5DTvPyE7bxmd8PBIDXaFJNrAdITHZ89AcfMPo= X-Received: by 2002:a05:620a:e81:b0:6cb:d90d:3021 with SMTP id w1-20020a05620a0e8100b006cbd90d3021mr9663570qkm.435.1664638786645; Sat, 01 Oct 2022 08:39:46 -0700 (PDT) MIME-Version: 1.0 Date: Sat, 1 Oct 2022 16:39:35 +0100 Message-ID: To: PHP internals Content-Type: multipart/alternative; boundary="0000000000008eb1a005e9faeb0f" Subject: Sanitize filters From: tekiela246@gmail.com (Kamil Tekiela) --0000000000008eb1a005e9faeb0f Content-Type: text/plain; charset="UTF-8" Hi Internals, For quite some time now, PHP's sanitize filters have "Rustled My Jimmies". These filters bother me because I can't really justify their existence. I can understand that a few of them are sensible and may come in handy, but I would like to talk about some of these in particular. In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to be a priority due to its confusing name and behaviour. The rest is slightly less dangerous, but as was pointed out to me in a recent conversation with a PHP developer, these filters are all very confusing. I would like to have some opinions on the following filters. What do you think we should do with them? Deprecate? Fix? Provide better documentation? --- *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or encode special characters." Now, what does that mean? PHP has two functions for URL encoding: urlencode used for encoding query-string parts, and rawurlencode used for encoding any other URL part (two different RFCs are followed by these functions). Which of these RFCs is applied in this filter? Furthermore, the description says that "special characters" can be stripped or encoded. Is one of these actions the default and the other can be selected by a flag or are both optional? What are these special characters? Are they special in the context of URL? If so, why did we encode them first? If these are HTML special characters (there's no single definition of special HTML chars), then why does this filter encode them if the filter is for URL sanitization? What does backtick have to do with any of this (FILTER_FLAG_STRIP_BACKTICK)? *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of PHP 7.3.0)" This filter was added as a replacement for magic_quotes filter. According to PHP documentation, addslashes is supposed to be used when injecting PHP variables into eval'd string. Real-life showed that this function is used in a lot of places that have nothing to do with PHP's eval. I am not sure if the sanitize filter is misused in a similar fashion, but judging from the fact that it was meant as a replacement for magic_quotes, my guess is that it's very likely still abused. *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits and !#$%&'*+-=?^_`{|}~@.[]." Which RFC does this adhere to? It strips slashes and quoted parts, doesn't allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This filter is ok for simple usage, but it isn't true to any known specification AFAIK. *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with ASCII value less than 32, optionally strip or encode other special characters." What's the intended purpose of this filter? "Special characters" are still not clearly defined, but at least it's more clear than the FILTER_SANITIZE_ENCODED description. Same question about backticks though: why? Why encode ASCII <32 chars? *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled by setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this filter is aware of the default_charset and if a sequence of bytes is detected that makes up an invalid character in the current character set then the entire string is rejected resulting in a 0-length string. When using this filter as a default filter, see the warning below about setting the default flags to 0." Not to be mistaken with FILTER_SANITIZE_SPECIAL_CHARS. As long as it's not used with filter_input(), it's the least problematic. We have htmlspecialchars() though, so how useful is this filter? *FILTER_UNSAFE_RAW *- What makes it unsafe? Why isn't this just called FILTER_RAW_STRING? If the value being filtered is something other than a string, what will this filter return? Integers, floats, booleans and nulls are converted to a string, Arrays and objects make the filter fail. --- Let's quickly mention the filter flags. The FILTER_FLAG_STRIP_LOW flag will also remove tabs, carriage returns and newlines as these are all less than 32 ASCII codes. When is this useful and expected? The FILTER_FLAG_ENCODE_LOW flag "encodes" ASCII <32 codes presumably into HTML entities, although that's not specified anywhere in the PHP manual. The word HTML does not appear on the https://www.php.net/manual/en/filter.filters.flags.php page. What do these characters look like when presented by HTML? When is it ever useful to use this flag? FILTER_FLAG_ENCODE_AMP & FILTER_FLAG_STRIP_BACKTICK - why is this even a thing? Due to flags, FILTER_VALIDATE_EMAIL will happily validate email addresses that would be otherwise mangled by FILTER_SANITIZE_EMAIL. These are just the things I found confusing and strange about the sanitize filters. Let's try to put ourselves in the shoes of an average PHP developer trying to comprehend these filters. It's quite easy to shoot yourself in the foot if you try to use them. The PHP manual doesn't do a good job of explaining them, but that's probably because they are not easy to explain. I can't come up with good examples of when they should be used. Regards, Kamil --0000000000008eb1a005e9faeb0f--