Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:18738 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 75997 invoked by uid 1010); 9 Sep 2005 09:49:22 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 75982 invoked from network); 9 Sep 2005 09:49:22 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 9 Sep 2005 09:49:22 -0000 X-Host-Fingerprint: 80.74.107.235 mail.zend.com Linux 2.5 (sometimes 2.4) (4) Received: from ([80.74.107.235:38906] helo=mail.zend.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 38/8A-17383-C9A51234 for ; Fri, 09 Sep 2005 05:49:18 -0400 Received: (qmail 27210 invoked from network); 9 Sep 2005 09:49:06 -0000 Received: from internal.zend.office (HELO ?127.0.0.1?) (10.1.1.1) by internal.zend.office with SMTP; 9 Sep 2005 09:49:06 -0000 Message-ID: <43215A91.8050409@zend.com> Date: Fri, 09 Sep 2005 13:49:05 +0400 User-Agent: Thunderbird 1.4 (X11/20050907) MIME-Version: 1.0 To: php-dev CC: Dmitry Stogov , Andrei Zmievski Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Subject: unserialize() & unicode issues From: antony@zend.com (Antony Dovgal) Hello all. I'm currently working on unicode support in serialize()/unserialize() and stuck with some issues. Here they are: 1) What to do with unserializing serialized unicode strings when unicode_semantics is Off? I presume it's safe to create & return IS_UNICODE in this case ? 2) Classnames are serialized without U: or s: prefix, but I can detect unicode string by it's leading "\". It's looks kinda tricky, but on the other hand forward slash can't appear there if it's not unicode. Or should I change it to use U:/s: prefixes? (Didn't try it yet, so I can't say how difficult it would be). The other problem here is that we can't use unicode class names when unicode_semantics is Off because in this case class_table stores them as IS_STRING and we won't be able to find class entry by it's unicode name (thanks to Val for noticing this). 3) Currently serialize() produces valid \u0000 sequences, which can be parsed/restored perfectly fine when reading them from a file or returning from serialize(). But specifying them as a const string won't work as these sequences get parsed in compile time. Short example: IMO the best way here is to change serialize() output to produce something else (for example \pu0000 instead of \u0000) - in this case it works just fine. Comments? -- Wbr, Antony Dovgal