Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:3644 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 9756 invoked from network); 30 Jul 2003 11:17:55 -0000 Received: from unknown (HELO walkabout.org) (65.114.110.10) by pb1.pair.com with SMTP; 30 Jul 2003 11:17:55 -0000 Received: from [198.77.48.21] (HELO ip-48-021.thinkextreme.net) by walkabout.org (CommuniGate Pro SMTP 4.1) with ESMTP id 3370977 for internals@lists.php.net; Wed, 30 Jul 2003 07:17:54 -0400 To: PHP Internals In-Reply-To: References: Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 (1.0.8-11) Date: 30 Jul 2003 07:26:02 -0400 Message-ID: <1059564364.1470.18.camel@coogle.localdomain> Mime-Version: 1.0 Subject: New Tidy Extension for PHP5 From: john@coggeshall.org (John Coggeshall) Hey all: I've written a new extension for PHP (ZE2 only) Based on the famed HTML Tidy's (http://tidy.sf.net/) library. This extension provides more than just an incredibly easy way to clean and repair HTML documents, and includes API for traversing an arbitrary HTML document using the ZE2 OO support. I've put the extension, basic PHPDocs, working examples, tests, and of course the extension itself on my web site: http://www.coggeshall.org/php/php-tidy-0-5b.tar.gz Although I haven't written true PHPDocs yet, README_TIDY in the tidy/ directory outlines the API including a description of the OO methods and proprieties available to accessing the parsed HTML document tree. I am interested in hearing from the internal@ community on my extension and finding out what everyone thinks of it. There are memleaks which still need to be tracked down (which I believe either have to do with ZE2 itself or because I am missing something in my OO implementation), and I know there are probably bugs.. I'd welcome suggestions, patches, and even just "it broke doing this". I plan on maintaining this extension on my web site and perhaps PECL (unless of course this is something worthy for the standard PHP5 distro), so if nothing else you can always find it there. Regards, John PS -- Here is a paste of one of the examples for those who are curious to how the OO stuff works (pulls all links out): tag in */ $html = tidy_get_html($tidy); /* Traverse the document tree */ print_r(get_links($html)); function get_links($node) { $urls = array(); /* Check to see if we are on an tag or not */ if($node->id == TIDY_TAG_A) { /* If we are, find the HREF attribute */ $attrib = $node->get_attr_type(TIDY_ATTR_HREF); if($attrib) { /* Add the value of the HREF attrib to $urls */ $urls[] = $attrib->value; } } /* Are there any children? */ if($node->has_children()) { /* Traverse down each child recursively */ foreach($node->children as $child) { /* Append the results from recursion to $urls */ foreach(get_links($child) as $url) { $urls[] = $url; } } } return $urls; } ?> -- -~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~- John Coggeshall john at coggeshall dot org http://www.coggeshall.org/ -~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~--~=~-