Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:42552 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 44218 invoked from network); 11 Jan 2009 05:10:21 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 11 Jan 2009 05:10:21 -0000 Authentication-Results: pb1.pair.com smtp.mail=markus@fischer.name; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=markus@fischer.name; sender-id=unknown Received-SPF: error (pb1.pair.com: domain fischer.name from 62.179.121.33 cause and error) X-PHP-List-Original-Sender: markus@fischer.name X-Host-Fingerprint: 62.179.121.33 viefep13-int.chello.at Solaris 10 (beta) Received: from [62.179.121.33] ([62.179.121.33:24630] helo=viefep13-int.chello.at) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 22/8B-26912-A3F79694 for ; Sun, 11 Jan 2009 00:10:19 -0500 Received: from edge04.upc.biz ([192.168.13.239]) by viefep13-int.chello.at (InterMail vM.7.09.01.00 201-2219-108-20080618) with ESMTP id <20090111051016.PVVX17187.viefep13-int.chello.at@edge04.upc.biz>; Sun, 11 Jan 2009 06:10:16 +0100 Received: from genuine.home ([84.112.136.139]) by edge04.upc.biz with edge id 25AA1b02W30dqkc045ACsq; Sun, 11 Jan 2009 06:10:16 +0100 X-SourceIP: 84.112.136.139 Received: from vserv01.home ([192.168.1.20] helo=[127.0.0.1]) by genuine.home with esmtpa (Exim 4.50) id 1LLsZ4-0005gY-8W; Sun, 11 Jan 2009 06:08:50 +0100 Message-ID: <49697F2C.3050900@fischer.name> Date: Sun, 11 Jan 2009 06:10:04 +0100 User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: Thomas Koch CC: internals@lists.php.net References: <200901101830.35403.thomas@koch.ro> In-Reply-To: <200901101830.35403.thomas@koch.ro> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Score: -29 X-Spam-Level: --- X-Spam-Report: Spam detection software, running on the system "genuine.home", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi, albeit *not* as a daemon, we've successfully developed a Crawler in PHP within our company. It can run for hours without a leak, if I remember correctly it's peak memory consumption is below 64MB. However we're crawling only a small amount of URLs, just around 10.000 . [...] Content analysis details: (-3.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -3.3 ALL_TRUSTED Did not pass through any untrusted hosts 0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score: 0.5000] 0.3 AWL AWL: From: address is in the auto white-list Subject: Re: [PHP-DEV] php daemons, memory From: markus@fischer.name (Markus Fischer) Hi, albeit *not* as a daemon, we've successfully developed a Crawler in PHP within our company. It can run for hours without a leak, if I remember correctly it's peak memory consumption is below 64MB. However we're crawling only a small amount of URLs, just around 10.000 . As Brian mentioned: free your database resources, unset unused variables. We've had one major rewrite which, besides re-architecturing the whole thing for plugin/modularity, involved auditing every step to make sure resources are properly freed. Usually a PHP developer doesn't have to pay much attention to it because of the wide-used process-fork model (but I guess I don't need to tell you that :). But you'll get often beaten by PHP itself: it has quite some leaks and finding/tracking them done costs time, sometimes requires skill at the C level of PHP to properly understand/diagnose things and if you were (unfortunately) successful in identifying a PHP problem you've report a bug, preferable attach provide a patch/workaround. For example, we've had to fight http://bugs.php.net/bug.php?id=43450 . Tracking this PHP problem was quite time consuming, involving multiple developers, etc. Luckily we could work around this, but it was pretty annoying. We actually planned to release this as open source, donate it to Zend, whatever. Legally it's done within the company, just no one had the time for the publishing process, going over things, etc. :/ As a sidenote: We've hit the current limit of our Crawler implementation in PHP itself: we can't to parallel fetching/processing of URLs in a efficient manner. You can get things quick running in PHP, but doing things with style and a serious architecture hits its limits. We've gone to Java for such cases, made sense for us anyway as we had to move away from Zend_Search_Lucene as it had performance problems with our index where as Lucene/Solr was still mostly bored. Will be interesting to see if http://code.google.com/p/marjory/ can handle this. Ops, off-topic. HTH, - Markus