Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:118796 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 99127 invoked from network); 11 Oct 2022 15:53:21 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 11 Oct 2022 15:53:21 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8212C1804F7 for ; Tue, 11 Oct 2022 08:53:18 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,PDS_OTHER_BAD_TLD, SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS16276 149.56.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from tls2.org (tls2.org [149.56.142.28]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 11 Oct 2022 08:53:17 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: thruska@cubiclesoft.com) with ESMTPSA id 0F92A3F443 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cubiclesoft.com; s=default; t=1665503597; bh=gNvAcJdA4fOZD9kbxuhTp567wwOGjiabrK9dlEIOcHg=; h=Date:Subject:To:References:From:In-Reply-To:From; b=GsVyC4t8MRtX1WoUZcQ4hkbXYyqpFnKCIevXkmxREQK3FR+iTlpEl8AOPgU0mCDiS i22mGl5YdBGNJWfboeliO38TLGV7jDiPB0yX7FyhnEYWaC13msue9HsHd/K10bIjtP ZYBbWjIu3CpZz+oQ2HiEuI2UJF/m9SYieaMMvFnbhKKyibB/1k1Iz1iSTaOx8i+8pZ bi5SC1bsLIzlP12jTDo5mMjlhdic+3HahAMxvV0D/mlYqBOTBDFW4VdxiubDspRIUB iWDEJxaCsOHzTPpHwyqxS5pV9W0J3G5bExcHq1AybrjkoKt4X/SB3Q4oniZdNjdxf5 cDlvWeVLIuERg== Message-ID: <2e5467ed-5298-fa28-ba81-c43d07de6641@cubiclesoft.com> Date: Tue, 11 Oct 2022 08:53:15 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 Content-Language: en-US To: Rowan Tommins , internals@lists.php.net References: <0cfb9a7b-1168-42ef-ae1a-bdc72210de43@app.fastmail.com> <73b9c782-bcdf-7520-ea96-b2a265a933e2@gmail.com> In-Reply-To: <73b9c782-bcdf-7520-ea96-b2a265a933e2@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Sanitize filters From: thruska@cubiclesoft.com (Thomas Hruska) On 10/6/2022 1:19 AM, Rowan Tommins wrote: > On 05/10/2022 22:35, David Gebler wrote: >> There are multiple RFC standards for email address format but AFAIK >> PHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them. > > FILTER_SANITIZE_EMAIL is a very short list of characters which claims to > be based on RFC 822 section 6: > https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295 > > FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attempting > to adhere to; it's one of many long unreadable regexes I've seen online > claiming to cover all possible addresses. (Actually, there are now two > regexes there, because there's a different version to support > FILTER_FLAG_EMAIL_UNICODE). > https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651 > >> The idea behind my suggestion for something like is_valid_email >> (whatever it might be named) is as a step towards deprecating and >> removing the entire existing filter API, which I think many of us >> agree is a mess. > > You described FILTER_VALIDATE_EMAIL as "notorious for being next to > useless"; that gives us two possibilities: > > a) A new function will be just as useless, because it will be based on > the same implementation > b) There is a better implementation out there, which we should start > using in ext/filter right now For (b), well, there is always the option of handling email addresses the way the IETF intended instead of using regexes. For example, SMTP::MakeValidEmailAddress() from: https://github.com/cubiclesoft/ultimate-email Does three things quite differently from ext/filter: 1) It uses a custom state engine to implement half of the relevant IETF EBNF grammars and then cheats for the other half. The very complex specifications that the IETF (and W3C) produces should generally be implemented as custom state engines (finite state machines or FSMs) in software. A custom state engine can correctly identify certain common input errors and both transparently and correctly fix those errors in very specific instances as it processes the input (e.g. gmail,com -> gmail.com happens often). State engines can also accurately and correctly do things such as remove CFWS (comments and folding whitespace) from email addresses, which are not necessary components of an email address and CFWS causes all kinds of issues. State engines, when done right, can even outperform all other functional implementations. State engines can also read partial input and maintain their internal state while using few resources to process very large inputs (not particularly relevant in this case). The current regex-based approach in ext/filter is obviously causing some problems that can probably be fixed by using a custom state engine. Important caveat: Custom state engines do run the risk of winding up in an infinite loop when forgetting to properly transition between states or forgetting to move pointers through the input, resulting in DoS issues. Been there, done that - they are both very easy things to do. 2) It parses email addresses in reverse: Domain part first, local part second. The EBNF grammars for the domain part are simpler and less contentious than the grammars for the local part. Also, IIRC, the domain portion can't contain '@' while the local portion can - it's been a while since I looked at the specs though. 3) It considers sanitization and validation as being the same function. There is no separate SMTP::IsValidEmailAddress() in the library because there is no need for one. If MakeValidEmailAddress() can't turn an input into a valid email address string, it returns an error. If the returned email address is not the same as the one that was input, the original address can be viewed as technically "invalid." One shared internal function for both FILTER_SANITIZE_EMAIL and FILTER_VALIDATE_EMAIL would produce consistent output/results. Other thoughts: I'm aware that a regex is effectively defining a state engine as a compact string. However, as evidenced by the two Perl CPAN regexes for email addresses currently in use, regexes are limited in utility/function and are somewhat inflexible, get more difficult to read and comprehend once they get longer than a few dozen bytes, and can't readily correct errors or other problems in complex input strings. The ~250 lines of userland code referenced above is also not perfect (e.g. extracting characters using substr() is rather inefficient) but it works well enough. The userland code also performs a DNS MX record check by default, but that is its own complex can of worms and was probably not the best idea I've ever had. However, the three main concepts are the important takeaways here, not the referenced userland code. > My gut feel is that (a) is true, and there is no point considering what > a new function would be called, because we don't know how to implement it. Perhaps the above will help to at least provide some new ideas to think about/ponder. -- Thomas Hruska CubicleSoft President CubicleSoft has over 80 original open source projects and counting. Plus a couple of commercial/retail products. What software are you looking to build?