Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:108573
To: Nikita Popov <nikita.ppv@gmail.com>,
 Larry Garfield <larry@garfieldtech.com>
Cc: php internals <internals@lists.php.net>
References: <CAF+90c9TWvj9Podt3uCdf43SzwweLfjn1H3DbWnt+pSE5UKYSw@mail.gmail.com>
 <466bb718-4513-4a87-81e9-295ad3983443@www.fastmail.com>
 <CAF+90c_A3t6kD1bchD33OrLHAWL0JXOLQ4KJCBoxNyhfJ8dOPA@mail.gmail.com>
Message-ID: <21b96a22-464c-b308-766d-fcfba746e664@cubiclesoft.com>
Date: Fri, 14 Feb 2020 06:31:59 -0700
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327
 Thunderbird/11.0.1
MIME-Version: 1.0
In-Reply-To: <CAF+90c_A3t6kD1bchD33OrLHAWL0JXOLQ4KJCBoxNyhfJ8dOPA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [RFC] token_get_all() TOKEN_AS_OBJECT mode
From: thruska@cubiclesoft.com (Thomas Hruska)

On 2/14/2020 1:48 AM, Nikita Popov wrote:
> On Thu, Feb 13, 2020 at 6:06 PM Larry Garfield <larry@garfieldtech.com>
> wrote:
> 
>> On Thu, Feb 13, 2020, at 3:47 AM, Nikita Popov wrote:
>>> Hi internals,
>>>
>>> This has been discussed a while ago already, now as a proper proposal:
>>> https://wiki.php.net/rfc/token_as_object
>>>
>>> tl;dr is that it allows you to get token_get_all() output as an array of
>>> PhpToken objects. This reduces memory usage, improves performance, makes
>>> code more uniform and readable... What's not to like?
>>>
>>> An open question is whether (at least to start with) PhpToken should be
>>> just a data container, or whether we want to add some helper methods to
>> it.
>>> If this generates too much bikeshed, I'll drop methods from the proposal.
>>>
>>> Regards,
>>> Nikita
>>
>> I love everything about this.
>>
>> 1) I would agree with Nicolas that a static constructor would be better.
>> I don't know about polyfilling it, but it's definitely more
>> self-descriptive.
>>
>> 2) I'm skeptical about the methods.  I can see them being useful, but also
>> being bikeshed material.  For instance, if you're doing annotation parsing
>> then docblocks are not ignorable.  They're what you're actually looking for.
>>
>> Two possible additions, feel free to ignore if they're too complicated:
>>
>> 1) Should it return an array of token objects, or a lazy iterable?  If I'm
>> only interested in certain types (eg, doc strings, classes, etc.) then a
>> lazy iterable would allow me to string some filter and map operations on to
>> it and use even less memory overall, since the whole tree is not in memory
>> at once.
>>
> 
> I'm going to take you up on your offer and ignore this one :P Returning
> tokens as an iterator is inefficient because it requires full lexer state
> backups and restores for each token. Could be optimized, but I wouldn't
> bother with it for this feature. I also personally have no use-case for a
> lazy token stream. (It's technically sufficient for parsing, but if you
> want to preserve formatting, you're going to be preserving all the tokens
> anyway.)

Try passing a 10MB PHP file that's all code into token_get_all().  It's 
pretty easy to hit hard memory limits and/or start crashing PHP when 
token_get_all() tokenizes the whole thing into a giant array or set of 
objects.  Calling gc_mem_caches() when the previous RAM bits aren't 
needed anymore helps.  Stream-based token parsing would be better for 
RAM usage but I can see how that might be complex to implement and 
largely not worth it since such scenarios will be rare and require the 
ability to maintain lexer state externally as you mentioned and would 
only be used by this part of the software.

-- 
Thomas Hruska
CubicleSoft President

I've got great, time saving software that you will find useful.

http://cubiclesoft.com/

And once you find my software useful:

http://cubiclesoft.com/donate/