Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:24976 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 76610 invoked by uid 1010); 25 Jul 2006 23:48:43 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 76578 invoked from network); 25 Jul 2006 23:48:43 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 25 Jul 2006 23:48:43 -0000 X-PHP-List-Original-Sender: andrei@gravitonic.com X-Host-Fingerprint: 204.11.219.139 lerdorf.com Linux 2.5 (sometimes 2.4) (4) Received: from ([204.11.219.139:33169] helo=lerdorf.com) by pb1.pair.com (ecelerity 2.1.1.3 r(11751M)) with ESMTP id 51/F8-04178-929A6C44 for ; Tue, 25 Jul 2006 19:28:42 -0400 Received: from [10.10.14.165] (ip10.fa1-0-2.occ.iinet.com [198.145.33.10]) (authenticated bits=0) by lerdorf.com (8.13.7/8.13.7/Debian-1) with ESMTP id k6PNSb33029663; Tue, 25 Jul 2006 16:28:38 -0700 In-Reply-To: <006501c6b036$2f6f3e60$0100a8c0@pc07653> References: <006501c6b036$2f6f3e60$0100a8c0@pc07653> Mime-Version: 1.0 (Apple Message framework v750) X-Priority: 3 Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-ID: <642A77E2-5255-415C-B961-4DAC391C74F6@gravitonic.com> Cc: "PHPdev" Content-Transfer-Encoding: 7bit Date: Tue, 25 Jul 2006 16:28:38 -0700 To: Nuno Lopes X-Mailer: Apple Mail (2.750) Subject: Re: [PHP-DEV] upgrading the zlib extension to unicode From: andrei@gravitonic.com (Andrei Zmievski) On Jul 25, 2006, at 3:03 PM, Nuno Lopes wrote: > Hello, > > So Andrei asked me to upgrade the zlib extension, but I have a few > questions I would like to discuss with you: > * when receiving an unicode string, what should we do? compress > with as-is, prepend a BOM header (and skip it while > uncompressing)? (now I'm unsure if PHP/ICU uses utf16 in the > machine endianess or not) It does use UTF-16 in machine specific endian format. I think you have several approaches you can take when compressing a Unicode string: 1. Compress as-is 2. Convert the string to big endian, for example, and compress 3. Convert to UTF-8 and then compress The problem with #2 and #3 is decompression. You need to know that it was a Unicode string and do appropriate conversion after decompressing. > * when uncompressing, check for a BOM header and return a unicode > string if it is present? return always a binary string? BOM header is not present in internal UTF-16 strings. It is only present if you convert them to UTF-16BE or UTF-16LE. -Andrei