Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:84100 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 4248 invoked from network); 1 Mar 2015 10:15:01 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Mar 2015 10:15:01 -0000 Authentication-Results: pb1.pair.com smtp.mail=yohgaki@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=yohgaki@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.43 as permitted sender) X-PHP-List-Original-Sender: yohgaki@gmail.com X-Host-Fingerprint: 209.85.216.43 mail-qa0-f43.google.com Received: from [209.85.216.43] ([209.85.216.43:46078] helo=mail-qa0-f43.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id B6/65-63732-4A6E2F45 for ; Sun, 01 Mar 2015 05:15:01 -0500 Received: by mail-qa0-f43.google.com with SMTP id bm13so18978090qab.2 for ; Sun, 01 Mar 2015 02:14:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=qCppyyK069mQqKSKQPAxULvaUf9jGl7w8f7qZZ1Z1EU=; b=DHwTXOHxpUqK0T7ITEfr9Y8tBFEEjjqprU9mYIe3cWN1G5D7eujV+5hDyOioz9K2wI zDnj7JFDhrRUlay3QOoDd4COjsHiPidcjL4CnTwMNAa8S78nq2QcZchD6EvQwXp/4QNu JXQ0CDOzgow9GB7jU8GCnEhR3R/LrIy5zzMOvx2p1dYM9qTgd6venkQckVpr5CqyhWQC o9rjmcYdN+HVtzrx2vZlLpJUKIaW26MTF5FrsWp+k5NYIzHy64MH0beZ6cn9S4O11EJa wwLAF3jU1pF4CZGj7gifvkkhpTSX+7DAve4jlAda5HHV4TsKTWKCAXA+EJhXO+rPJAfG aSsw== X-Received: by 10.140.152.2 with SMTP id 2mr42638682qhy.16.1425204897990; Sun, 01 Mar 2015 02:14:57 -0800 (PST) MIME-Version: 1.0 Sender: yohgaki@gmail.com Received: by 10.229.198.8 with HTTP; Sun, 1 Mar 2015 02:14:17 -0800 (PST) In-Reply-To: References: <1413875212.2624.3.camel@localhost.localdomain> <54469840.3070708@sugarcrm.com> <1414051917.2624.35.camel@localhost.localdomain> <1414060726.2624.60.camel@localhost.localdomain> <1414072403.3228.3.camel@kuechenschabe> <87D717D5-273B-4A32-A3E5-83EBDFD314CB@ajf.me> <1414077690.3228.12.camel@kuechenschabe> <54495CF6.30608@sugarcrm.com> <1414130585.2624.64.camel@localhost.localdomain> Date: Sun, 1 Mar 2015 19:14:17 +0900 X-Google-Sender-Auth: NCBHSfCwl-pizbONVb-7oeDCRjA Message-ID: To: Joe Watkins Cc: Chris Wright , Stas Malyshev , =?UTF-8?Q?Johannes_Schl=C3=BCter?= , Andrea Faulds , Dmitry Stogov , Philip Hofstetter , PHP Internals Content-Type: multipart/alternative; boundary=001a1135a3dcad500c0510375f37 Subject: Re: [PHP-DEV] [RFC] UString From: yohgaki@ohgaki.net (Yasuo Ohgaki) --001a1135a3dcad500c0510375f37 Content-Type: text/plain; charset=UTF-8 Hi Joe, On Sun, Mar 1, 2015 at 6:14 PM, Yasuo Ohgaki wrote: > On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins > wrote: > >> This is just a quick note to announce my intention to ready this RFC >> for voting next week. >> >> I know I'm a little late maybe, I was real sick most of last week, so >> couldn't do anything useful. >> >> A couple of us intend to fix outstanding issues on github and those >> raised here, tidy the RFC and open the vote for 7. >> >> I would ask anyone interested to scan through this thread and announce >> concerns that are not mentioned asap. >> > > I appreciate your proposal! > Rowan pointed out some important things. I don't understand details as I > don't read your code yet. I'll try to read and comment in a few days. > I guess you would like to start voting today or tomorrow, so I briefly read your code. I think your approach is good. I like UString be UTF-8 always by default regardless of other settings. i.e. default_charset, internal_encoding. I see few missing key APIs that would be critical for multibyte char handling, like string length, string width, normalization, string conversions like Zenkaku to Hankaku, encoding(codepage) converter. However, all of these may be added later as they are already implemented in ICU. I think UString may be better to use UTF-8 always to make users life a little simpler. Your constructor only have codepage setting that is used as UString codepage to support other codepage(encodings). Rather than to have various encoding support, I think constructor needs encoding(codepage) conversion feature. Codepage parameter is better to be used as "from encoding(codepage)" parameter and convert any encoding(codepage) to UTF-8. If conversion fails, it should raise exception. It's better to have forgiving API for malformed strings if user explicitly specified to do so. Constructor may be public function __construct([string $string [, string $source_codepage [, string $substitute_char] ]); $soure_codepage is source string encoding(codepage) and $string is converted to UTF-8 always. If $substitute_char is omitted, raise exception for invalid $string. If $substitute_char is specified (it can be '' empty string), convert $string according to $source_codepage and just remove/replace invalid byte stream in $string. With this constructor, string stored in UString object is always valid UTF-8. Any character encoding (including UTF-16/32 and 200 encoding names supported by ICU) may be used as source string. Since there will be no variable codepage setting for UString object, followings may be removed. public static function getDefaultCodepage(); public static function setDefaultCodepage(string $codepage); ICU uses "codepage" as "character encoding", but it may be better to use "character encoding" as people are not used to ICU terminology. This is what I thought. I didn't read your code carefully, so I might be wrong. Please correct me if I'm mistaken. I suppose there are other people working on Unicode string based simpler libraries. I would like to hear opinion from them. BTW, we really need byte_len(). strlen() is just confusing API... It's not a scope of this RFC, though. Regards, -- Yasuo Ohgaki yohgaki@ohgaki.net --001a1135a3dcad500c0510375f37--