Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:24976
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
In-Reply-To: <006501c6b036$2f6f3e60$0100a8c0@pc07653>
References: <006501c6b036$2f6f3e60$0100a8c0@pc07653>
Mime-Version: 1.0 (Apple Message framework v750)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-ID: <642A77E2-5255-415C-B961-4DAC391C74F6@gravitonic.com>
Cc: "PHPdev" <internals@lists.php.net>
Content-Transfer-Encoding: 7bit
Date: Tue, 25 Jul 2006 16:28:38 -0700
To: Nuno Lopes <nlopess@php.net>
Subject: Re: [PHP-DEV] upgrading the zlib extension to unicode
From: andrei@gravitonic.com (Andrei Zmievski)

On Jul 25, 2006, at 3:03 PM, Nuno Lopes wrote:

> Hello,
>
> So Andrei asked me to upgrade the zlib extension, but I have a few  
> questions I would like to discuss with you:
> * when receiving an unicode string, what should we do? compress  
> with as-is, prepend a BOM header (and skip it while  
> uncompressing)?  (now I'm unsure if PHP/ICU uses utf16 in the  
> machine endianess or not)

It does use UTF-16 in machine specific endian format. I think you  
have several approaches you can take when compressing a Unicode string:

1. Compress as-is
2. Convert the string to big endian, for example, and compress
3. Convert to UTF-8 and then compress

The problem with #2 and #3 is decompression. You need to know that it  
was a Unicode string and do appropriate conversion after decompressing.

> * when uncompressing, check for a BOM header and return a unicode  
> string if it is present? return always a binary string?

BOM header is not present in internal UTF-16 strings. It is only  
present if you convert them to UTF-16BE or UTF-16LE.

-Andrei