Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:60716 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 37554 invoked from network); 1 Jun 2012 15:58:00 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Jun 2012 15:58:00 -0000 Authentication-Results: pb1.pair.com smtp.mail=tjerk.meesters@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=tjerk.meesters@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.161.170 as permitted sender) X-PHP-List-Original-Sender: tjerk.meesters@gmail.com X-Host-Fingerprint: 209.85.161.170 mail-gg0-f170.google.com Received: from [209.85.161.170] ([209.85.161.170:63372] helo=mail-gg0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 6A/23-45898-786E8CF4 for ; Fri, 01 Jun 2012 11:57:59 -0400 Received: by ggnf2 with SMTP id f2so2086004ggn.29 for ; Fri, 01 Jun 2012 08:57:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=I5AZeAFCzQZxgCREt7CaKXS/PLaTRGc5Tx6nIAp0zg4=; b=Jnc4a4Tiy+1rKMzQONjL4PnqajT+dfrV5LxPHshdiDxgcBus/V26n9RBCArGgunjJ/ fcwkTuSlVuLmqxZ5RhVW6Q3IzMW3dHTWINjq/Ah/thZDC489giUrge+NtqkQ80ZYOUI1 073N/7deSxb6dVwBiZb9i2DTpUi6dfDzsJji/BxZghMXMZGVMe8f1PjMhb4bfspyQR0v JdvrhrTbPswFDGoNwSS8rX6kzBYLwvHkAna6NM6v3wApGttdJCQj+wBEA7Sz3KhTBNT7 7gokwO6nfGGfTDDCatMSjN0Ab2WIHW1nprfRS5p/C1wAA/7dLkr+xUbR/2XbkqaAO3WU M/3g== MIME-Version: 1.0 Received: by 10.101.143.14 with SMTP id v14mr950293ann.55.1338566276877; Fri, 01 Jun 2012 08:57:56 -0700 (PDT) Sender: tjerk.meesters@gmail.com Received: by 10.146.249.13 with HTTP; Fri, 1 Jun 2012 08:57:56 -0700 (PDT) Date: Fri, 1 Jun 2012 23:57:56 +0800 X-Google-Sender-Auth: CgW54hZZwaEh1TWcp8MnIgnnLFs Message-ID: To: PHP Internals Content-Type: text/plain; charset=ISO-8859-1 Subject: domdocument loadhtml and encoding From: datibbaW@hotmail.com (Tjerk Meesters) Gentlemen, Regarding this bug report: https://bugs.php.net/bug.php?id=49705 As more developers move away from using regular expressions to parse HTML and start using DOMDocument, I've noticed that quite a few stumble over encoding "issues". They're not bugs, because it's documented (I think) that if a document is loaded using ::loadHTMLFile() or if it contains a "content-type" meta tag which specifies the character encoding it will work as expected. So far I've suggested a hack that involves adding the meta-tag in front of the string that contains the HTML. As horrible as it seems, that does the job! That said, I'm hoping to get enough internals support to add a parameter to ::loadHTML() that set / overrides the default character set when processing the document; when given, any tags pertaining to character set encoding should be ignored (AFAIK that's also the browser's behavior). Btw, there's another patch that also introduces a new parameter to ::parseHTML() which has gone into 5.4 branch (https://bugs.php.net/bug.php?id=54037), so it looks like this would be the second (optional) parameter then. Thoughts? -- -- Tjerk