Proposal to a few fixes/improvements in the ini parser

3 years ago by Denis Ryabov — view source

unread

Hello internals,

I'd like to discuss some issues related to escaping of characters in the
ini parser (the lexer to be precise).

Currently double-quoted strings are processed twice: first time in
the <ST_DOUBLE_QUOTES>[^] lexer rule (to get string length), and then in
the zend_ini_escape_string function (to create string by processing all
escape sequences). The problem is that strings are processed
differently: lexer rule uses a look-behind approach to check double
quote character is escaped, and zend_ini_escape_string skips escaped
characters in a usual way (skip-next-char approach, like in PHP's
strings parser). As a result there are the following issue:

In some cases there is no way to escape final backslash in a string,
e.g. in the case of string followed by anything except of linebreak:

KEY1 = "prefix\" ; Warning: syntax error, unexpected end of file,
expecting TC_DOLLAR_CURLY or TC_QUOTED_STRING or '"'
KEY2 = "prefix\" ACONST

I'd switch to a PHP-way and require to escape each of special chars (",
$, ) in a usual (skip-next-char) way, without look-behind approach. It
may lead to a backward incompatibility to a code that use a sequence
like \" (instead of \") to get backslash followed by double quote,
but I'm not sure it's widely used in the wild (moreover, this point is
not explained in PHP docs, so none can rely on such a behavior).

In the <ST_DOUBLE_QUOTES>[^] lexer rule, the token is processed
starting from YYCURSOR position instead of yytext, as a result the first
character is not taken into account. In turn, it lead to no way to
escape the leading dollar character followed by open curvy brace:

KEY = "${" ; Warning: syntax error, unexpected end of file, expecting
TC_VARNAME

Also I'd like to note that currently ini parser doesn't support
standard escape sequences (\n, \t, etc.), though from official PHP docs
(https://www.php.net/manual/en/function.parse-ini-file.php) one may
expect it should be supported:

; \ is used to escape a value.
newline_is = "\n" ; results in the string "\n", not a newline character.

It seems to be easy to fix/implement above-mentioned things (I'll send a
PR in the case of no disagreement).

So, how would you rate this idea on the following scale (1-5)?

It's not necessary at all, let's keep current ini lexer as is.
Let's require escaping of special characters (", , $) only in a
uniform (skip-next-char) way.
Above with support of \t, \n, \v, \f, \r, \e sequences.
Above with support of \123 (octal) and \xAB (hex) charcodes.
Above with support of \u{12AB} (unicode hex codepoints); actually I'd
not like to implement it because I don't know how to deal with partial
contents like
KEY = "\u{"
(PHP stops with "Parse error: Invalid UTF-8 codepoint escape sequence",
but I'm not sure the ini parser should follow this rule).

Any comments are welcome.

Best regards,
Denis Ryabov