Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38434 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 7739 invoked from network); 19 Jun 2008 20:25:01 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 19 Jun 2008 20:25:01 -0000 Authentication-Results: pb1.pair.com header.from=michal.dziemianko@gmail.com; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=michal.dziemianko@gmail.com; spf=unknown; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain gmail.com does not designate 82.132.130.150 as permitted sender) X-PHP-List-Original-Sender: michal.dziemianko@gmail.com X-Host-Fingerprint: 82.132.130.150 vader.london.02.net Linux 2.6 Received: from [82.132.130.150] ([82.132.130.150:40927] helo=mail.o2.co.uk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 25/2A-57468-B90CA584 for ; Thu, 19 Jun 2008 16:25:00 -0400 Received: from [192.168.1.64] (78.86.103.249) by mail.o2.co.uk (8.0.013.3) (authenticated as michal_dziemianko@o2.co.uk) id 4851DD95014051A3 for internals@lists.php.net; Thu, 19 Jun 2008 21:24:57 +0100 In-Reply-To: References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> <484F910A.3080208@zend.com> <33FE7D54-B113-4C5F-9315-659F8550397E@gmail.com> Mime-Version: 1.0 (Apple Message framework v753.1) X-Priority: 3 Content-Type: multipart/alternative; boundary=Apple-Mail-2--623312344 Message-ID: Date: Thu, 19 Jun 2008 21:24:48 +0100 To: internals@lists.php.net X-Mailer: Apple Mail (2.753.1) Subject: Re: [PHP-DEV] Algorithm Optimizations - string search From: michal.dziemianko@gmail.com (Michal Dziemianko) --Apple-Mail-2--623312344 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Hi, I am also sorry for delay - got ill recently and spend a day in bed after night at emergency. I am working on other things now, and hope to post some patches soon. I will create patch for zend_memnstr as you suggest and post it here probably tomorrow. I have some ideas/ implementations/data already prepared for other functions (not only string related) but haven't got time to publish it neither here, nor on the page (http://212.85.117.53/gsoc/), but will try to do it till Monday. Michal On 2008-06-17, at 23:57, Nuno Lopes wrote: > Hi, > > Sorry for taking so long to answer, but I'm trying to catch up last > stuff. > It's known that usually to optimize things for longer inputs you > usually end up making things for short inputs worst. So IMHO, I > think you should have the len==1 optimization and then use the KMP > algorithm. Your implementation can be further optimized (at least > on 32 bits machines), but seems ok for now. > I suggest you to produce a final patch, send it here, and then move > on to optimize other things (like strtr() that I'm sure wikipedia > guys will appreciate). > > Nuno > > > ----- Original Message ----- >> Hello again >> I have setup small page where I am publishing all the profiling >> and evaluation results. It is here: http://83.168.205.202/~michal/ >> standard15/ >> So far I have put there function usage profile, zend_memnstr >> analysis, stripos and strrpos analysis including some charts etc. >> CVS diffs where applicable are also posted along with comparison >> of original and patched code. >> Michal >> >> On 2008-06-11, at 09:47, Stanislav Malyshev wrote: >> >>> Hi! >>> >>>> Here are some statistics: >>>> - average haystack length: 624.2 >>>> - average needle length: 1.9 ! -> 63% of needles of length 1 >>>> - avg length of haystacks shorter than avg: 41.0 -> 85% of all >>>> haystacks >>>> - avg length of haystacks longer than avg: 5685.11 >>> >>> I think it would be interesting to see same excluding 1-char >>> needles since in this case it should do one-char lookup (btw, if >>> we don't do it on C level, it might be a good idea to). >>> >>> >>>> Although strpos implements fix for that, some other functions >>>> don't. My idea >>>> is than to implement ZEND_MEMNSTR once again in shape: >>>> if (needle_len = 1) >>>> here just linear sweep >>>> else if haystack_len < 5000 (5000 is arbitrary - maybe some >>>> more tests >>>> needed to choose good value) >>>> original implementation (as it is the best one in this case) >>>> else >>>> BM/KMP (i think BM will be better in this case, as some people >>>> suggested) >>> >>> I'm not sure very big haystacks really worth the trouble - how >>> many of them are used? It may be interesting to see medians >>> instead of averages for that. But len=1 I think worth having >>> special case. >>> -- >>> Stanislav Malyshev, Zend Software Architect >>> stas@zend.com http://www.zend.com/ >>> (408)253-8829 MSN: stas@zend.com > --Apple-Mail-2--623312344--