Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:30479 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 72881 invoked by uid 1010); 6 Jul 2007 04:35:52 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 72865 invoked from network); 6 Jul 2007 04:35:51 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Jul 2007 04:35:51 -0000 Authentication-Results: pb1.pair.com smtp.mail=php_lists@realplain.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=php_lists@realplain.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain realplain.com from 209.235.148.40 cause and error) X-PHP-List-Original-Sender: php_lists@realplain.com X-Host-Fingerprint: 209.235.148.40 mail30c35.nsolutionszone.com Linux 2.5 (sometimes 2.4) (4) Received: from [209.235.148.40] ([209.235.148.40:41951] helo=mail30c35.nsolutionszone.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 04/A5-26602-3A6CD864 for ; Fri, 06 Jul 2007 00:35:49 -0400 X-POP-User: gerald059.centurytel.net Received: from pc1 (72-161-58-82.dyn.centurytel.net [72.161.58.82]) by mail30c35.nsolutionszone.com (8.13.6.20060614/8.13.1) with SMTP id l664Zdtt028781; Fri, 6 Jul 2007 04:35:40 GMT Message-ID: <00b201c7bf87$1c18b560$0201a8c0@pc1> To: , , "Tomas Kuliavas" References: <1181829227.3478.3.camel@localhost.localdomain> <7d5a202f0706141844l3c75b556hdbecbcd5a43747c9@mail.gmail.com> <4671F184.2020401@lerdorf.com> <6sof73dj69ldpspfc5ukrc58qr9ckbin2b@4ax.com> <4677E7B1.2080305@lerdorf.com> <4677F5FB.1070206@lerdorf.com> <4678252F.2050803@sci.fi> <46783212.4020900@lerdorf.com> <34654.216.230.84.67.1183064088.squirrel@www.l-i-e.com> <54557.78.61.224.253.1183098089.squirrel@avilys.eik.lt> <2159.24.1.37.132.1183693437.squirrel@www.l-i-e.com> Date: Thu, 5 Jul 2007 23:35:39 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1807 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1896 Subject: Re: [PHP-DEV] What is the use of "unicode.semantics" in PHP 6? From: php_lists@realplain.com ("Matt Wilmas") Hi Richard, ----- Original Message ----- From: "Richard Lynch" Sent: Thursday, July 05, 2007 10:43 PM > On Fri, June 29, 2007 1:21 am, Tomas Kuliavas wrote: > >> If unicode semantics are "on" what exactly is borked in PHP 5? > > > > In Unicode mode \[0-7]{1,3} and \x[0-9A-Fa-f]{1,2} refer to unicode > > code > > points and not to octal or hexadecimal byte values. Fix is not > > backwards > > compatible. > > Gak. > > You mean this will break: > > $mask = 0xf0; > $value = $_POST['foo'] & $mask; > ?> > > because of Unicode? > > That's nuts. > > That can't be right... No, that shouldn't break. $mask is an int, and the other operand with & etc. would also be converted to int, so it should be the same whether $_POST['foo'] is a binary string or Unicode. And I don't understand the previous message about \[0-7]{1,3} and \x[0-9A-Fa-f]{1,2} (inside of strings, that means) referring to Unicode code points. I think octal and hex escapes work the same in Unicode mode... > > Scripts can't match bytes. How they are supposed to check if string is > > in > > plain ascii or in 8bit? Do conversion to ASCII and check for errors > > instead of looking for 8bit byte values? How can scripts replace 8bit > > bytes with some other strings? ISO-8859-2 decoding table contains 95 > > entries written and evaluated as binary strings. Same thing applies to > > other iso-8859 and windows-125x character sets. iso-89859-1 and utf-8 > > decoding does not use mapping tables and performs complex calculations > > with byte values. multibyte character set decoding might actually > > benefit > > from unicode_encode(), if Table 325 (http://www.php.net/unicode) > > provides > > more information about U_INVALID_SUBSTITUTE and other unicode. > > settings. > > I don't even understand this. > > But if I haven't done something new-fangled to make a string be some > new-fangled Unicode thingie, then it's just plain old ASCII, no? > > Or PHP can just assume that anyway... No, that's basically the issue that this thread is about -- that when unicode.semantics=On, even though you *haven't done* anything new-fangled with Unicode, it IS Unicode regardless (unless binary strings are explicitly used). That's how things may behave differently all of a sudden. Did you see my message a couple weeks ago?: http://marc.info/?l=php-dev&m=118234541809801&w=2 Seems to me it would be great if any new Unicode stuff had to be explicitly specified, though internally Unicode would always be there ready to use, regardless of a setting, and old code would continue to work as before. What do you think? I'd hoped for some replies about it, since I also have some ideas about possible internals concerns... [...] > > PHP6 could introduce new Unicode aware functions, but Unicode > > implementation choose to modify existing ones. All low level string > > operations ($string[1]) are Unicode aware by default and not when > > script > > actually asks for it. Such implementation is designed for developers, > > who > > don't care about Unicode support and want it out of the box without > > any > > changes in their Unicode unaware scripts. It is not designed for > > developers that actually need it and want to have code working in PHP6 > > and > > PHP4/5. > > But an old script ought to just work... Again, not necessarily if the Unicode switch is on. Matt