Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:108573 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 36775 invoked from network); 14 Feb 2020 15:17:23 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 14 Feb 2020 15:17:23 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8AFC71804DA for ; Fri, 14 Feb 2020 05:32:03 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: *** X-Spam-Status: No, score=3.1 required=5.0 tests=BAYES_00, CK_HELO_DYNAMIC_SPLIT_IP,HELO_DYNAMIC_SPLIT_IP,RDNS_DYNAMIC, SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS16276 149.56.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from 28.ip-149-56-142.net (28.ip-149-56-142.net [149.56.142.28]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 14 Feb 2020 05:32:02 -0800 (PST) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: thruska@cubiclesoft.com) with ESMTPSA id EE28D3E929 To: Nikita Popov , Larry Garfield Cc: php internals References: <466bb718-4513-4a87-81e9-295ad3983443@www.fastmail.com> Message-ID: <21b96a22-464c-b308-766d-fcfba746e664@cubiclesoft.com> Date: Fri, 14 Feb 2020 06:31:59 -0700 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [RFC] token_get_all() TOKEN_AS_OBJECT mode From: thruska@cubiclesoft.com (Thomas Hruska) On 2/14/2020 1:48 AM, Nikita Popov wrote: > On Thu, Feb 13, 2020 at 6:06 PM Larry Garfield > wrote: > >> On Thu, Feb 13, 2020, at 3:47 AM, Nikita Popov wrote: >>> Hi internals, >>> >>> This has been discussed a while ago already, now as a proper proposal: >>> https://wiki.php.net/rfc/token_as_object >>> >>> tl;dr is that it allows you to get token_get_all() output as an array of >>> PhpToken objects. This reduces memory usage, improves performance, makes >>> code more uniform and readable... What's not to like? >>> >>> An open question is whether (at least to start with) PhpToken should be >>> just a data container, or whether we want to add some helper methods to >> it. >>> If this generates too much bikeshed, I'll drop methods from the proposal. >>> >>> Regards, >>> Nikita >> >> I love everything about this. >> >> 1) I would agree with Nicolas that a static constructor would be better. >> I don't know about polyfilling it, but it's definitely more >> self-descriptive. >> >> 2) I'm skeptical about the methods. I can see them being useful, but also >> being bikeshed material. For instance, if you're doing annotation parsing >> then docblocks are not ignorable. They're what you're actually looking for. >> >> Two possible additions, feel free to ignore if they're too complicated: >> >> 1) Should it return an array of token objects, or a lazy iterable? If I'm >> only interested in certain types (eg, doc strings, classes, etc.) then a >> lazy iterable would allow me to string some filter and map operations on to >> it and use even less memory overall, since the whole tree is not in memory >> at once. >> > > I'm going to take you up on your offer and ignore this one :P Returning > tokens as an iterator is inefficient because it requires full lexer state > backups and restores for each token. Could be optimized, but I wouldn't > bother with it for this feature. I also personally have no use-case for a > lazy token stream. (It's technically sufficient for parsing, but if you > want to preserve formatting, you're going to be preserving all the tokens > anyway.) Try passing a 10MB PHP file that's all code into token_get_all(). It's pretty easy to hit hard memory limits and/or start crashing PHP when token_get_all() tokenizes the whole thing into a giant array or set of objects. Calling gc_mem_caches() when the previous RAM bits aren't needed anymore helps. Stream-based token parsing would be better for RAM usage but I can see how that might be complex to implement and largely not worth it since such scenarios will be rare and require the ability to maintain lexer state externally as you mentioned and would only be used by this part of the software. -- Thomas Hruska CubicleSoft President I've got great, time saving software that you will find useful. http://cubiclesoft.com/ And once you find my software useful: http://cubiclesoft.com/donate/