Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:38473 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 93530 invoked from network); 20 Jun 2008 15:50:42 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 20 Jun 2008 15:50:42 -0000 Authentication-Results: pb1.pair.com smtp.mail=michal.dziemianko@gmail.com; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=michal.dziemianko@gmail.com; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain gmail.com does not designate 82.132.130.151 as permitted sender) X-PHP-List-Original-Sender: michal.dziemianko@gmail.com X-Host-Fingerprint: 82.132.130.151 yoda.london.02.net Linux 2.6 Received: from [82.132.130.151] ([82.132.130.151:55333] helo=mail.o2.co.uk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id F9/22-16112-0B1DB584 for ; Fri, 20 Jun 2008 11:50:09 -0400 Received: from [192.168.1.64] (78.86.103.249) by mail.o2.co.uk (8.0.013.3) (authenticated as michal_dziemianko@o2.co.uk) id 4851DD950166BC71; Fri, 20 Jun 2008 16:50:03 +0100 In-Reply-To: <485BCBE6.5080201@oracle.com> References: <7E62CA6E-83F4-4F9C-86FB-75EBE7D489C9@gmail.com> <484D36EB.9080202@macvicar.net> <31fe29920806110114i263a47b4sd66d4849ceb22980@mail.gmail.com> <484F910A.3080208@zend.com> <33FE7D54-B113-4C5F-9315-659F8550397E@gmail.com> <11BACFBA-C2C7-4405-9092-679606B41DB1@gmail.com> <485BCBE6.5080201@oracle.com> Mime-Version: 1.0 (Apple Message framework v753.1) Content-Type: multipart/alternative; boundary=Apple-Mail-8--553409825 Message-ID: <61602CD5-CC13-40EC-A8CC-35288768B0A4@gmail.com> Cc: internals@lists.php.net Date: Fri, 20 Jun 2008 16:49:51 +0100 To: Christopher Jones X-Mailer: Apple Mail (2.753.1) Subject: Re: [PHP-DEV] Algorithm Optimizations - string search From: michal.dziemianko@gmail.com (Michal Dziemianko) --Apple-Mail-8--553409825 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Hello I am doing benchmarking all the time - not only of the patches, but also some other code that might be useful later. Here are some values for the STRRPOS patch: http://212.85.117.53/gsoc/ index.php?option=com_content&view=article&id=48:strrpos-patch- proposal&catid=36:patches&Itemid=56 The charts will appear soon, as well as data specifically for strripos (I wrote yesterday that I am going to post most of what I have already till Monday) - I have the data as row output form test script, just need to prepare picture. Michal On 2008-06-20, at 16:25, Christopher Jones wrote: > > Hi Michal, > > The GSOC program lets us share practical advice that we've learnt from > experience. > > Looking at > http://212.85.117.53/gsoc/index.php? > option=com_content&view=article&id=49:strripos-possible- > patches&catid=36:patches&Itemid=56 > How do the algorithms work in practice? I think timing results should > be given. > > Also the profiling results aren't explained well: it took me a while > to work out that your profiling was not time using the first suggested > algorithm, but was occurence of a situation. I know you mentioned > this before, but I'd forgotten. That's exactly where project > documents (like your page) should help the busy reader by being very > explicit. > > My experience is that project specifications need to be self > contained. Also they can often been improved by stating what hasn't > or won't be covered - this leave no room for ambiguity. For example, > if you don't plan to do any benchmarks, say so. > > Keep up the good work, > > Chris > > > Michal Dziemianko wrote: > > Hello, > > Here goes first diff - to keep it simple and avoid confusion I > will post > > them one-by-one after the previous is accepted/rejected. > > Optimization of STRRPOS/STRRIPOS for PHP_5_3. > > > > 2 things are changed: > > * implementation of search loop from theta(NM) to o(N), omega(NM) > - it > > is now the same as original zend_memnstr - using zend_memrchr > instead of > > memchr function. Here is comparison of performance: > > http://212.85.117.53/gsoc/index.php? > option=com_content&view=article&id=48:strrpos-patch- > proposal&catid=36:patches&Itemid=56 > > > > This implementation is faster than original for any reasonable > input (a > > little slower for "aaab" in "aaaaaaaab", but that is really bad > case), > > and I think in this case we should not bother with KMP/BM as I > haven't > > found any call to this functions with haystack longer than 100 > (NOTE: on > > MY server, over ~1h - this can be different for wikipedia > guys;))). In > > case somebody needs KMP - I have it implemented to work > "backwards" thus > > suitable for strrpos. > > > > * second thing is pre-lowering the haystack in strripos in case of > > needle len=1 - why is explained here: > > http://212.85.117.53/gsoc/index.php? > option=com_content&view=article&id=49:strripos-possible- > patches&catid=36:patches&Itemid=56 > > > > > > Here is the diff: > > -------------------------------------------------- > > diff -u -d -r1.445.2.14.2.69.2.22 string.c > > --- ext/standard/string.c 27 May 2008 10:29:33 -0000 > > 1.445.2.14.2.69.2.22 > > +++ ext/standard/string.c 20 Jun 2008 08:08:34 -0000 > > @@ -1876,18 +1876,19 @@ > > > > if (needle_len == 1) { > > /* Single character search can shortcut memcmps */ > > - while (e >= p) { > > - if (*e == *needle) { > > - RETURN_LONG(e - p + (offset > 0 ? > offset > > : 0)); > > - } > > - e--; > > - } > > + if ((e = (char *)zend_memrchr(p, *needle, e - p > + 1)) > > != NULL) { > > + RETURN_LONG(e - p + (offset > 0 ? > offset : 0)); > > + } > > RETURN_FALSE; > > } > > > > while (e >= p) { > > - if (memcmp(e, needle, needle_len) == 0) { > > - RETURN_LONG(e - p + (offset > 0 ? > offset : 0)); > > + if ((e = (char *)zend_memrchr(p, *needle, e - p + > 1)) != > > NULL) { > > + if (!memcmp(needle, e, needle_len)) { > > + RETURN_LONG(e - p + (offset > 0 ? > offset > > : 0)); > > + } > > + } else { > > + RETURN_FALSE; > > } > > e--; > > } > > @@ -1925,6 +1926,11 @@ > > if ((haystack_len == 0) || (needle_len == 0)) { > > RETURN_FALSE; > > } > > + > > + haystack_dup = estrndup(haystack, haystack_len); > > + php_strtolower(haystack_dup, haystack_len); > > + needle_dup = estrndup(needle, needle_len); > > + php_strtolower(needle_dup, needle_len); > > > > if (needle_len == 1) { > > /* Single character search can shortcut memcmps > > @@ -1932,34 +1938,36 @@ > > if (offset >= 0) { > > if (offset > haystack_len) { > > php_error_docref(NULL TSRMLS_CC, > > E_NOTICE, "Offset is greater than the length of haystack string"); > > + efree(needle_dup); > > + efree(haystack_dup); > > RETURN_FALSE; > > } > > - p = haystack + offset; > > - e = haystack + haystack_len - 1; > > + p = haystack_dup + offset; > > + e = haystack_dup + haystack_len - 1; > > } else { > > - p = haystack; > > + p = haystack_dup; > > if (-offset > haystack_len || offset < - > INT_MAX) { > > php_error_docref(NULL TSRMLS_CC, > > E_NOTICE, "Offset is greater than the length of haystack string"); > > + efree(needle_dup); > > + efree(haystack_dup); > > RETURN_FALSE; > > } > > - e = haystack + haystack_len + offset; > > + e = haystack_dup + haystack_len + offset; > > } > > - /* Borrow that ord_needle buffer to avoid repeatedly > > tolower()ing needle */ > > - *ord_needle = tolower(*needle); > > + > > while (e >= p) { > > - if (tolower(*e) == *ord_needle) { > > + if (*e == *needle_dup) { > > + efree(needle_dup); > > + efree(haystack_dup); > > RETURN_LONG(e - p + (offset > 0 ? > offset > > : 0)); > > } > > e--; > > } > > + efree(needle_dup); > > + efree(haystack_dup); > > RETURN_FALSE; > > } > > > > - needle_dup = estrndup(needle, needle_len); > > - php_strtolower(needle_dup, needle_len); > > - haystack_dup = estrndup(haystack, haystack_len); > > - php_strtolower(haystack_dup, haystack_len); > > - > > if (offset >= 0) { > > if (offset > haystack_len) { > > efree(needle_dup); > > @@ -1985,11 +1993,13 @@ > > } > > > > while (e >= p) { > > - if (memcmp(e, needle_dup, needle_len) == 0) { > > - efree(haystack_dup); > > - efree(needle_dup); > > - RETURN_LONG(e - p + (offset > 0 ? > offset : 0)); > > - } > > + if ((e = (char *)zend_memrchr(p, *needle, e - p + > 1)) != > > NULL) { > > + if (!memcmp(needle, e, needle_len)) { > > + efree(haystack_dup); > > + efree(needle_dup); > > + RETURN_LONG(e - p + (offset > 0 ? > offset > > : 0)); > > + } > > + } > > e--; > > } > > ---------------------------------------------- > > > > Tested and PASS. Valgrind does not show any memory leaks, but if > > somebody more competent can confirm it, than I will really > appreciate it. > > Cheers, > > Michal > > > > PS If somebody can take a look at that: > > http://212.85.117.53/gsoc/index.php? > option=com_content&view=article&id=52:small- > buginconsistence&catid=38:other&Itemid=58 > > it would be also great. If it should be the same in both > functions than > > I have patch and test cases for it prepared. > > > > > > -- > Christopher Jones, Oracle > Email: christopher.jones@oracle.com Tel: +1 650 506 8630 > Blog: http://blogs.oracle.com/opal/ Free PHP Book: http:// > tinyurl.com/f8jad --Apple-Mail-8--553409825--