Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:118796
Message-ID: <2e5467ed-5298-fa28-ba81-c43d07de6641@cubiclesoft.com>
Date: Tue, 11 Oct 2022 08:53:15 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327
 Thunderbird/11.0.1
Content-Language: en-US
To: Rowan Tommins <rowan.collins@gmail.com>, internals@lists.php.net
References: <CAGBsUrdri=hXQHy5JnVhpiz51FbnqVBHjrBg9-tj8m=vzfZtzA@mail.gmail.com>
 <0cfb9a7b-1168-42ef-ae1a-bdc72210de43@app.fastmail.com>
 <CA+p1FaaDimmXXxfmGYogENvTpKp2qvyYNt4MyJT9HrehiTbXGg@mail.gmail.com>
 <CAGgaK7LMx+hd_V_EPNnmRb8pggAmP62vVtLjh=xgCFCkCQ0dgw@mail.gmail.com>
 <CA+p1FaapMaiWzz438Jr7N6B5qg0FecUuzzoe5oce-PyggaSzXw@mail.gmail.com>
 <CANYPt=WE+iTbECh3Kzh+jqkkqz2+6ezG6PGpSKypJm0yGQhMCw@mail.gmail.com>
 <ef9896b7-eb66-b944-a2e8-ea0adb44abf2@gmail.com>
 <CA+p1FaaeKQGFPVG2C_nYFp5Z07WupruMidmwaPjjJ=XG5rZWgQ@mail.gmail.com>
 <73b9c782-bcdf-7520-ea96-b2a265a933e2@gmail.com>
In-Reply-To: <73b9c782-bcdf-7520-ea96-b2a265a933e2@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Sanitize filters
From: thruska@cubiclesoft.com (Thomas Hruska)

On 10/6/2022 1:19 AM, Rowan Tommins wrote:
> On 05/10/2022 22:35, David Gebler wrote:
>> There are multiple RFC standards for email address format but AFAIK 
>> PHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them.
> 
> FILTER_SANITIZE_EMAIL is a very short list of characters which claims to 
> be based on RFC 822 section 6: 
> https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295 
> 
> FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attempting 
> to adhere to; it's one of many long unreadable regexes I've seen online 
> claiming to cover all possible addresses. (Actually, there are now two 
> regexes there, because there's a different version to support 
> FILTER_FLAG_EMAIL_UNICODE). 
> https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651
> 
>> The idea behind my suggestion for something like is_valid_email 
>> (whatever it might be named) is as a step towards deprecating and 
>> removing the entire existing filter API, which I think many of us 
>> agree is a mess.
> 
> You described FILTER_VALIDATE_EMAIL as "notorious for being next to 
> useless"; that gives us two possibilities:
> 
> a) A new function will be just as useless, because it will be based on 
> the same implementation


> b) There is a better implementation out there, which we should start 
> using in ext/filter right now

For (b), well, there is always the option of handling email addresses 
the way the IETF intended instead of using regexes.

For example, SMTP::MakeValidEmailAddress() from:

https://github.com/cubiclesoft/ultimate-email

Does three things quite differently from ext/filter:

1)  It uses a custom state engine to implement half of the relevant IETF 
EBNF grammars and then cheats for the other half.  The very complex 
specifications that the IETF (and W3C) produces should generally be 
implemented as custom state engines (finite state machines or FSMs) in 
software.  A custom state engine can correctly identify certain common 
input errors and both transparently and correctly fix those errors in 
very specific instances as it processes the input (e.g. gmail,com -> 
gmail.com happens often).  State engines can also accurately and 
correctly do things such as remove CFWS (comments and folding 
whitespace) from email addresses, which are not necessary components of 
an email address and CFWS causes all kinds of issues.  State engines, 
when done right, can even outperform all other functional 
implementations.  State engines can also read partial input and maintain 
their internal state while using few resources to process very large 
inputs (not particularly relevant in this case).  The current 
regex-based approach in ext/filter is obviously causing some problems 
that can probably be fixed by using a custom state engine.

Important caveat:  Custom state engines do run the risk of winding up in 
an infinite loop when forgetting to properly transition between states 
or forgetting to move pointers through the input, resulting in DoS 
issues.  Been there, done that - they are both very easy things to do.

2)  It parses email addresses in reverse:  Domain part first, local part 
second.  The EBNF grammars for the domain part are simpler and less 
contentious than the grammars for the local part.  Also, IIRC, the 
domain portion can't contain '@' while the local portion can - it's been 
a while since I looked at the specs though.

3)  It considers sanitization and validation as being the same function. 
  There is no separate SMTP::IsValidEmailAddress() in the library 
because there is no need for one.  If MakeValidEmailAddress() can't turn 
an input into a valid email address string, it returns an error.  If the 
returned email address is not the same as the one that was input, the 
original address can be viewed as technically "invalid."  One shared 
internal function for both FILTER_SANITIZE_EMAIL and 
FILTER_VALIDATE_EMAIL would produce consistent output/results.


Other thoughts:  I'm aware that a regex is effectively defining a state 
engine as a compact string.  However, as evidenced by the two Perl CPAN 
regexes for email addresses currently in use, regexes are limited in 
utility/function and are somewhat inflexible, get more difficult to read 
and comprehend once they get longer than a few dozen bytes, and can't 
readily correct errors or other problems in complex input strings.  The 
~250 lines of userland code referenced above is also not perfect (e.g. 
extracting characters using substr() is rather inefficient) but it works 
well enough.  The userland code also performs a DNS MX record check by 
default, but that is its own complex can of worms and was probably not 
the best idea I've ever had.  However, the three main concepts are the 
important takeaways here, not the referenced userland code.


> My gut feel is that (a) is true, and there is no point considering what 
> a new function would be called, because we don't know how to implement it.

Perhaps the above will help to at least provide some new ideas to think 
about/ponder.

-- 
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?