upgrading the zlib extension to unicode

19 years ago by Nuno Lopes — view source — reply

unread

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a few questions
I would like to discuss with you:

when receiving an unicode string, what should we do? compress with as-is,
prepend a BOM header (and skip it while uncompressing)? (now I'm unsure if
PHP/ICU uses utf16 in the machine endianess or not)
when uncompressing, check for a BOM header and return a unicode string if
it is present? return always a binary string?

I also have another question, but unrelated with the zlib extension, that is
what is a binary string in PHP 6? I think there were some changes on that
part (and there isn't the IS_BINARY by now) and I don't really now the
difference between a binary string and the old string (aka runtime_encode'd
string)

I hope my questions are clear and make sense,
Nuno

19 years ago by Ilia Alshanetsky — view source — reply

unread

I think whenever we are storing data in external source such as
compressed file, shared memory and so on, we need to treat the data
as binary.

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a few
questions I would like to discuss with you:

when receiving an unicode string, what should we do? compress
with as-is, prepend a BOM header (and skip it while
uncompressing)? (now I'm unsure if PHP/ICU uses utf16 in the
machine endianess or not)

when uncompressing, check for a BOM header and return a unicode
string if it is present? return always a binary string?

I also have another question, but unrelated with the zlib
extension, that is what is a binary string in PHP 6? I think there
were some changes on that part (and there isn't the IS_BINARY by
now) and I don't really now the difference between a binary string
and the old string (aka runtime_encode'd string)

Ilia Alshanetsky

19 years ago by Michael Wallner — view source — reply

unread

Nuno Lopes wrote:

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a few
questions I would like to discuss with you:

I'd like to collaborate on this. Besides reimplementing the output
handler to use the new API, I planned to upgrade it to something
similar like http_encoding_api.

when receiving an unicode string, what should we do? compress with
as-is, prepend a BOM header (and skip it while uncompressing)? (now
I'm unsure if PHP/ICU uses utf16 in the machine endianess or not)

I think it should require a binary string.

when uncompressing, check for a BOM header and return a unicode
string if it is present? return always a binary string?

That would make it inconsistent if decoding data from a source other
than PHP, thus I'd say--as before--a binary string.

I also have another question, but unrelated with the zlib extension,
that is what is a binary string in PHP 6? I think there were some
changes on that part (and there isn't the IS_BINARY by now) and I
don't really now the difference between a binary string and the old
string (aka runtime_encode'd string)

IS_STRING is practically a binary string, AFAICT.

Regards,

Michael

19 years ago by Nuno Lopes — view source — reply

unread

Nuno Lopes wrote:

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a few
questions I would like to discuss with you:

I'd like to collaborate on this. Besides reimplementing the output
handler to use the new API, I planned to upgrade it to something similar
like http_encoding_api.

OK, I think I've completed the upgrade (it now requires a binary string),
but I left the ob_gzhandler() function for you ;)

Nuno

19 years ago by Andrei Zmievski — view source — reply

unread

So you decide to make the user pass in a binary string explicitly. I
suppose that's an approach since it makes them think about what
format the binary string should be in before it's compressed.

-Andrei

Nuno Lopes wrote:

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a
few questions I would like to discuss with you:

I'd like to collaborate on this. Besides reimplementing the output
handler to use the new API, I planned to upgrade it to something
similar like http_encoding_api.

OK, I think I've completed the upgrade (it now requires a binary
string), but I left the ob_gzhandler() function for you ;)

Nuno

19 years ago by Nuno Lopes — view source — reply

unread

Yep. After thoughting about it, I think it is the best way. If the user
wants to save the unicode directly (and save BOM, etc), he has functions to
handle that.
So, I've just changed the 's' parameters to 'S' (binary only). I left the
ob_start handler for Mike since he said he wanted to help with that. The
filter code was already upgraded (probably by Sara).

Nuno

----- Original Message -----

So you decide to make the user pass in a binary string explicitly. I
suppose that's an approach since it makes them think about what format
the binary string should be in before it's compressed.

-Andrei

Nuno Lopes wrote:

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a few
questions I would like to discuss with you:

I'd like to collaborate on this. Besides reimplementing the output
handler to use the new API, I planned to upgrade it to something
similar like http_encoding_api.

OK, I think I've completed the upgrade (it now requires a binary
string), but I left the ob_gzhandler() function for you ;)

Nuno

19 years ago by Michael Wallner — view source — reply

unread

Michael Wallner wrote:

I planned to upgrade it to something similar
like http_encoding_api.

So, are there any objections on that?

Regards,

Michael

19 years ago by Andrei Zmievski — view source — reply

unread

Sorry, refresh my memory, what are we talking about here?

-A

Michael Wallner wrote:

I planned to upgrade it to something similar
like http_encoding_api.

So, are there any objections on that?

Regards,

Michael

19 years ago by Michael Wallner — view source — reply

unread

Andrei Zmievski wrote:

Sorry, refresh my memory, what are we talking about here?

I was planning to port http://cvs.php.net/pecl/http/http_encoding_api.c?view=markup
to ext/zlib. It would provide a concise zlib API for PHP and uses a more
performant iterative inflate approach instead of the current retry approach.
Additionally, if we required a minimum zlib version of 1.2, we'd benefit
of zlib internal GZIP capabilities.

Regards,

Michael

19 years ago by Andrei Zmievski — view source — reply

unread

Okay, so this doesn't have much to do with Unicode stuff. I think if
you port it, calling it something other than http_encoding would be a
good idea, to avoid confusion with other "encoding" settings.

-Andrei

Andrei Zmievski wrote:

Sorry, refresh my memory, what are we talking about here?

I was planning to port
http://cvs.php.net/pecl/http/http_encoding_api.c?view=markup
to ext/zlib. It would provide a concise zlib API for PHP and uses a
more
performant iterative inflate approach instead of the current retry
approach.
Additionally, if we required a minimum zlib version of 1.2, we'd
benefit
of zlib internal GZIP capabilities.

Regards,

Michael

19 years ago by Michael Wallner — view source — reply

unread

Andrei Zmievski wrote:

Okay, so this doesn't have much to do with Unicode stuff. I think if you
port it, calling it something other than http_encoding would be a good
idea, to avoid confusion with other "encoding" settings.

Yeah, of course I thought of php_zlib_*.

Thanks,

Michael

19 years ago by Andrei Zmievski — view source — reply

unread

Hello,

So Andrei asked me to upgrade the zlib extension, but I have a few
questions I would like to discuss with you:

when receiving an unicode string, what should we do? compress
with as-is, prepend a BOM header (and skip it while
uncompressing)? (now I'm unsure if PHP/ICU uses utf16 in the
machine endianess or not)

It does use UTF-16 in machine specific endian format. I think you
have several approaches you can take when compressing a Unicode string:

Compress as-is
Convert the string to big endian, for example, and compress
Convert to UTF-8 and then compress

The problem with #2 and #3 is decompression. You need to know that it
was a Unicode string and do appropriate conversion after decompressing.

when uncompressing, check for a BOM header and return a unicode
string if it is present? return always a binary string?

BOM header is not present in internal UTF-16 strings. It is only
present if you convert them to UTF-16BE or UTF-16LE.

-Andrei