Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:53480 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 43586 invoked from network); 21 Jun 2011 14:38:44 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Jun 2011 14:38:44 -0000 Authentication-Results: pb1.pair.com header.from=johncrenshaw@priacta.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=johncrenshaw@priacta.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain priacta.com designates 64.95.72.238 as permitted sender) X-PHP-List-Original-Sender: johncrenshaw@priacta.com X-Host-Fingerprint: 64.95.72.238 mx1.myoutlookonline.com Received: from [64.95.72.238] ([64.95.72.238:46530] helo=mx1.myoutlookonline.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 89/84-12449-2FCA00E4 for ; Tue, 21 Jun 2011 10:38:43 -0400 Received: from mxout.myoutlookonline.com (localhost [127.0.0.1]) by mx1.myoutlookonline.com (Postfix) with ESMTP id 2894478D684 for ; Tue, 21 Jun 2011 10:38:38 -0400 (EDT) X-Virus-Scanned: by SpamTitan at mail.lan Received: from HUB022.mail.lan (unknown [10.110.2.1]) by mx1.myoutlookonline.com (Postfix) with ESMTP id CA3B278D619 for ; Tue, 21 Jun 2011 10:38:28 -0400 (EDT) Received: from MAILR001.mail.lan ([10.110.18.27]) by HUB022.mail.lan ([10.110.17.22]) with mapi; Tue, 21 Jun 2011 10:37:25 -0400 To: "internals@lists.php.net" Date: Tue, 21 Jun 2011 10:38:22 -0400 Thread-Topic: [PHP-DEV] foreach() for strings Thread-Index: AcwwDzNTmovhuPzRRM+ZWyEHVCUoVAAC1taA Message-ID: References: <4DFF7A12.8060808@sugarcrm.com> <4E00818C.7040201@lsces.co.uk> <4E008EA3.4000403@lsces.co.uk> In-Reply-To: <4E008EA3.4000403@lsces.co.uk> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: RE: [PHP-DEV] foreach() for strings From: johncrenshaw@priacta.com (John Crenshaw) Pierre Joye wrote: > On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine wrote: >> Pierre Joye wrote: >>>> >>>> It depended on ICU there, and I would be against making a core thing i= n >>>>> PHP 5.x depend on ICU. >>> >>> It can and should be done as part of intl, actually. >>> >>> But that's somehow unrelated to the proposal here, as it is about >>> byte, not characters :) >> >> I believe this may be where some of the new niggles may be coming from? = With >> browsers returning unicode, it may be that some of the 'extra' character= s >> are being returned as multibyte rather than as single bytes? Such as the >> problem reported on the general list currently. How do we ensure that we= are >> dealing with single byte character strings nowadays? > > As it has been stated numerous times in this thread and other, we do > not do anything with multi bytes systems, unicode, etc. mbstring and > intl do, but php's string as of now is all about bytes, array of bytes > if I may describe them this way. > > And we can't change this behavior. This mindset is fundamentally broken. You can call it a byte array all you = want, but the truth is that 99.999% of the time, when a developer is using = a string they need it for characters, not for bytes, and characters are not= single byte. Even English users tend to submit Unicode range characters at= an alarming rate. If you're using a WYSIWYG editor, Chrome will submit non= -breaking-spaces as the actual UTF8 encoded character, not as an HTML encod= ed entity. Whether developers like it, or even know it, supporting an exten= ded universal character set is not really optional. PHP makes this bad enough with the whole collection of bytewise string func= tions, including many with no appropriate multibyte aware replacement, but = at least this can be avoided, quickly audited, and in the future can even b= e fixed in any number of ways with only a nominal BC impact. Hard coding th= is single byte idiocy into a language construct (foreach) though would be a= n incredibly awful idea. This would create a trap for new naive PHP develop= ers, and create a character set problem that the language could NEVER recov= er from without a massive BC break. This proposal is really about adding a feature which whenever it used is al= most guaranteed to be an error. It probably won't look to the developer lik= e an error during simple testing, but will almost certainly show up as an e= rror in production. Is it really worth all that for a bit of syntax sugar t= hat the developer will have to strip out anyway to fix their bug? If string iteration needs to be addressed in the core (and IMO it doesn't b= ecause it can be handled at the script level, but if it does) why not use i= terator classes? This gives the same functionality and prevents the languag= e from encouraging hidden bugs. John Crenshaw Priacta, Inc.