Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38336 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 76029 invoked from network); 17 Jun 2008 22:58:03 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 17 Jun 2008 22:58:03 -0000 Authentication-Results: pb1.pair.com smtp.mail=nlopess@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=nlopess@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 212.55.154.26 as permitted sender) X-PHP-List-Original-Sender: nlopess@php.net X-Host-Fingerprint: 212.55.154.26 relay6.ptmail.sapo.pt Linux 2.4/2.6 Received: from [212.55.154.26] ([212.55.154.26:58718] helo=sapo.pt) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 6A/F0-03584-97148584 for ; Tue, 17 Jun 2008 18:58:03 -0400 Received: (qmail 23858 invoked from network); 17 Jun 2008 22:57:57 -0000 Received: from unknown (HELO sapo.pt) (10.134.37.165) by relay6 with SMTP; 17 Jun 2008 22:57:57 -0000 Received: (qmail 17352 invoked from network); 17 Jun 2008 22:57:57 -0000 X-AntiVirus: PTMail-AV 0.3-0.92.0 X-Virus-Status: Clean (0.00751 seconds) Received: from unknown (HELO pc07654) (nunoplopes@sapo.pt@[85.240.52.171]) (envelope-sender ) by mta15 (qmail-ldap-1.03) with SMTP for ; 17 Jun 2008 22:57:57 -0000 Message-ID: To: "Michal Dziemianko" , , "Scott MacVicar" References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> <484F910A.3080208@zend.com> <33FE7D54-B113-4C5F-9315-659F8550397E@gmail.com> In-Reply-To: <33FE7D54-B113-4C5F-9315-659F8550397E@gmail.com> Date: Tue, 17 Jun 2008 23:57:54 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Windows Mail 6.0.6001.18000 X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6001.18000 Subject: Re: [PHP-DEV] Algorithm Optimizations - string search From: nlopess@php.net ("Nuno Lopes") Hi, Sorry for taking so long to answer, but I'm trying to catch up last stuff. It's known that usually to optimize things for longer inputs you usually end up making things for short inputs worst. So IMHO, I think you should have the len==1 optimization and then use the KMP algorithm. Your implementation can be further optimized (at least on 32 bits machines), but seems ok for now. I suggest you to produce a final patch, send it here, and then move on to optimize other things (like strtr() that I'm sure wikipedia guys will appreciate). Nuno ----- Original Message ----- > Hello again > I have setup small page where I am publishing all the profiling and > evaluation results. It is here: http://83.168.205.202/~michal/ standard15/ > So far I have put there function usage profile, zend_memnstr analysis, > stripos and strrpos analysis including some charts etc. CVS diffs where > applicable are also posted along with comparison of original and patched > code. > Michal > > On 2008-06-11, at 09:47, Stanislav Malyshev wrote: > >> Hi! >> >>> Here are some statistics: >>> - average haystack length: 624.2 >>> - average needle length: 1.9 ! -> 63% of needles of length 1 >>> - avg length of haystacks shorter than avg: 41.0 -> 85% of all >>> haystacks >>> - avg length of haystacks longer than avg: 5685.11 >> >> I think it would be interesting to see same excluding 1-char needles >> since in this case it should do one-char lookup (btw, if we don't do it >> on C level, it might be a good idea to). >> >> >>> Although strpos implements fix for that, some other functions don't. My >>> idea >>> is than to implement ZEND_MEMNSTR once again in shape: >>> if (needle_len = 1) >>> here just linear sweep >>> else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests >>> needed to choose good value) >>> original implementation (as it is the best one in this case) >>> else >>> BM/KMP (i think BM will be better in this case, as some people >>> suggested) >> >> I'm not sure very big haystacks really worth the trouble - how many of >> them are used? It may be interesting to see medians instead of averages >> for that. But len=1 I think worth having special case. >> -- >> Stanislav Malyshev, Zend Software Architect >> stas@zend.com http://www.zend.com/ >> (408)253-8829 MSN: stas@zend.com