Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:115781
To: internals@lists.php.net
Message-ID: <3cd341d8-0572-fabf-4ec7-687195547e87@mobilejoomla.com>
Date: Tue, 24 Aug 2021 00:12:06 +0300
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
Subject: Proposal to a few fixes/improvements in the ini parser
From: denis@mobilejoomla.com (Denis Ryabov)

Hello internals,


I'd like to discuss some issues related to escaping of characters in the 
ini parser (the lexer to be precise).


1. Currently double-quoted strings are processed twice: first time in 
the <ST_DOUBLE_QUOTES>[^] lexer rule (to get string length), and then in 
the zend_ini_escape_string function (to create string by processing all 
escape sequences). The problem is that strings are processed 
differently: lexer rule uses a look-behind approach to check double 
quote character is escaped, and zend_ini_escape_string skips escaped 
characters in a usual way (skip-next-char approach, like in PHP's 
strings parser). As a result there are the following issue:

In some cases there is no way to escape final backslash in a string, 
e.g. in the case of string followed by anything except of linebreak:

KEY1 = "prefix\\" ; Warning: syntax error, unexpected end of file, 
expecting TC_DOLLAR_CURLY or TC_QUOTED_STRING or '"'
KEY2 = "prefix\\" ACONST

I'd switch to a PHP-way and require to escape each of special chars (", 
$, \) in a usual (skip-next-char) way, without look-behind approach. It 
may lead to a backward incompatibility to a code that use a sequence 
like \\" (instead of \\\") to get backslash followed by double quote, 
but I'm not sure it's widely used in the wild (moreover, this point is 
not explained in PHP docs, so none can rely on such a behavior).


2. In the <ST_DOUBLE_QUOTES>[^] lexer rule, the token is processed 
starting from YYCURSOR position instead of yytext, as a result the first 
character is not taken into account. In turn, it lead to no way to 
escape the leading dollar character followed by open curvy brace:

KEY = "\${" ; Warning: syntax error, unexpected end of file, expecting 
TC_VARNAME


3. Also I'd like to note that currently ini parser doesn't support 
standard escape sequences (\n, \t, etc.), though from official PHP docs 
(https://www.php.net/manual/en/function.parse-ini-file.php) one may 
expect it should be supported:

; \ is used to escape a value.
newline_is = "\\n" ; results in the string "\n", not a newline character.


It seems to be easy to fix/implement above-mentioned things (I'll send a 
PR in the case of no disagreement).

So, how would you rate this idea on the following scale (1-5)?

1) It's not necessary at all, let's keep current ini lexer as is.
2) Let's require escaping of special characters (", \, $) only in a 
uniform (skip-next-char) way.
3) Above with support of \t, \n, \v, \f, \r, \e sequences.
4) Above with support of \123 (octal) and \xAB (hex) charcodes.
5) Above with support of \u{12AB} (unicode hex codepoints); actually I'd 
not like to implement it because I don't know how to deal with partial 
contents like
KEY = "\u{"
(PHP stops with "Parse error: Invalid UTF-8 codepoint escape sequence", 
but I'm not sure the ini parser should follow this rule).

Any comments are welcome.


Best regards,
Denis Ryabov