Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38162 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 23991 invoked from network); 11 Jun 2008 08:47:15 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Jun 2008 08:47:15 -0000 Authentication-Results: pb1.pair.com smtp.mail=stas@zend.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=stas@zend.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain zend.com designates 212.25.124.162 as permitted sender) X-PHP-List-Original-Sender: stas@zend.com X-Host-Fingerprint: 212.25.124.162 mail.zend.com Windows 2000 SP4, XP SP1 Received: from [212.25.124.162] ([212.25.124.162:20946] helo=mx1.zend.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2B/3E-26183-1119F484 for ; Wed, 11 Jun 2008 04:47:15 -0400 Received: from [10.1.3.118] ([10.1.3.118]) by mx1.zend.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 11 Jun 2008 11:47:15 +0300 Message-ID: <484F910A.3080208@zend.com> Date: Wed, 11 Jun 2008 11:47:06 +0300 Organization: Zend Technologies User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: Michal Dziemianko CC: internals@lists.php.net References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> In-Reply-To: <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 11 Jun 2008 08:47:15.0398 (UTC) FILETIME=[BF8CE660:01C8CB9F] Subject: Re: [PHP-DEV] Algorithm Optimizations - string search From: stas@zend.com (Stanislav Malyshev) Hi! > Here are some statistics: > - average haystack length: 624.2 > - average needle length: 1.9 ! -> 63% of needles of length 1 > - avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks > - avg length of haystacks longer than avg: 5685.11 I think it would be interesting to see same excluding 1-char needles since in this case it should do one-char lookup (btw, if we don't do it on C level, it might be a good idea to). > Although strpos implements fix for that, some other functions don't. My idea > is than to implement ZEND_MEMNSTR once again in shape: > > if (needle_len = 1) > here just linear sweep > else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests > needed to choose good value) > original implementation (as it is the best one in this case) > else > BM/KMP (i think BM will be better in this case, as some people > suggested) I'm not sure very big haystacks really worth the trouble - how many of them are used? It may be interesting to see medians instead of averages for that. But len=1 I think worth having special case. -- Stanislav Malyshev, Zend Software Architect stas@zend.com http://www.zend.com/ (408)253-8829 MSN: stas@zend.com