Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38172 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 72749 invoked from network); 11 Jun 2008 21:44:29 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Jun 2008 21:44:29 -0000 Authentication-Results: pb1.pair.com smtp.mail=michal.dziemianko@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=michal.dziemianko@gmail.com; sender-id=pass; domainkeys=bad Received-SPF: pass (pb1.pair.com: domain gmail.com designates 66.249.92.173 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: michal.dziemianko@gmail.com X-Host-Fingerprint: 66.249.92.173 ug-out-1314.google.com Received: from [66.249.92.173] ([66.249.92.173:42300] helo=ug-out-1314.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id F2/97-26183-C3740584 for ; Wed, 11 Jun 2008 17:44:29 -0400 Received: by ug-out-1314.google.com with SMTP id h3so213070ugf.29 for ; Wed, 11 Jun 2008 14:44:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=WCU2ruC/n82f9CguhMS2CuqjvyKeSLCArdKS8GZYP7M=; b=mJ1eW99dubY8EIr6zNaX/h8rGM+8Hp6knGiD7+H/FAk1q4pLR1n2EY+PiXul1w3ko7 xiugUzAl7GqcCf9fE0utBFSc1aozBAllNPS2MfLucPKnPLh6MsPkX9HU0E0j/Jj1VAyY dYJjdXGo3XfDO7n8jVrtsdCUgeT4o9niWk5PI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=D1FFR+5NvnJS1WlzltaHyH8/MrOEmp4P2fue1i7bOrvHyURxHUWLa3LpgyAVvH4hB+ sHnQ5Q5ig8A5pAWczDm1f6X6Uo8hBe/kJ9+C9Zpsl/tdlchgOKltji/CqMmzhNK+f1SP Htzz/kTHNpetnSDVvwWit0mdxVstlqNq6gvGs= Received: by 10.210.118.7 with SMTP id q7mr311224ebc.99.1213220665637; Wed, 11 Jun 2008 14:44:25 -0700 (PDT) Received: by 10.210.25.6 with HTTP; Wed, 11 Jun 2008 14:44:25 -0700 (PDT) Message-ID: <31fe29920806111444u250d933au36d824f48c78ff5@mail.gmail.com> Date: Wed, 11 Jun 2008 22:44:25 +0100 To: "Stanislav Malyshev" , internals@lists.php.net In-Reply-To: <484F910A.3080208@zend.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1_4585094.1213220665629" References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> <484F910A.3080208@zend.com> Subject: Re: [PHP-DEV] Algorithm Optimizations - string search From: michal.dziemianko@gmail.com ("Michal Dziemianko") ------=_Part_1_4585094.1213220665629 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hello again On Wed, Jun 11, 2008 at 9:47 AM, Stanislav Malyshev wrote: > I'm not sure very big haystacks really worth the trouble - how many of them > are used? It may be interesting to see medians instead of averages for that. > But len=1 I think worth having special case. > -- > Here are some charts showing haystack and needle length distribution. Haystack: http://212.85.117.53/haystack.png (shows only lengths up to 2000 -> haystacks longer are only 0.7%) Needle: http://212.85.117.53/needle.png For haystack median is 19, for needle..1. As both haystack and needle are in most cases short the optimization should go in this direction. Now lets forget about the first patch I made (if I did this profiling before I wouldn't have came up with this idea). And consider for now only the patch for better execution in case of needle_len = 1 Here are some results of a comparison (10 000 000 iterations, first row is original zend_memnstr implementation, second with patch) time in seconds. match at 11: 0.72 0.43 match at 773 12.61 12.43 Here is diff for the patch: -------------------------------- --- Zend/zend_operators.h 19 Mar 2008 12:44:13 -0000 1.94.2.4.2.10.2.7 +++ Zend/zend_operators.h 11 Jun 2008 21:35:58 -0000 @@ -220,6 +220,10 @@ char *p = haystack; char ne = needle[needle_len-1]; + if (needle_len == 1){ + return (char *)memchr(p, *needle, (end-p+1)); + } + end -= needle_len; while (p <= end) { -------------------------------- To avoid any discussion about UNICODE etc: this is patch for PHP_5_3 - function memchr is used in original implementation of zend_memnstr (followed by memcmp) I will take a look into HEAD later today. Nathan Nobbe has shown me this: http://marc.info/?t=121254018200001&r=1&w=2 (thx Nathan). I did a small comparison of different implementations of stripos - the original one, and with lowering each caracter separately just before comparing (2 different versions). Here are some results (1st row is original implementation, 2nd with separate lowering), time in seconds, each stripos executed 1000000 times: string length 31, match at 15 0.63 0.53 str len 31, match at 28 0.72 0.93 str len 18 match at 15 0.44 0.50 str len 18 match at 5 0.42 0.32 string len 575 match at 296 9.03 6.07 len 575 match at 570 10.59 11.92 len 1220 match at 400 19.06 10.76 len 1220 match at 1206 22.190000 24.680000 It shows that in case of short haystacks (assuming that there is a match, and it is in first half of the haystack) implementation that does not lower the whole haystack at the beginning is better. Even in case of match closely to the end of string the difference is not that big (2 seconds in last test shown). As said before - the median is 19...so maybe it is worth considering. All comments are highly appreciated, regards Michal ------=_Part_1_4585094.1213220665629--