Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:26021 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 52121 invoked by uid 1010); 10 Oct 2006 20:51:15 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 52105 invoked from network); 10 Oct 2006 20:51:15 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 10 Oct 2006 20:51:15 -0000 Authentication-Results: pb1.pair.com smtp.mail=kingwez@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=kingwez@gmail.com; sender-id=pass; domainkeys=good Received-SPF: pass (pb1.pair.com: domain gmail.com designates 64.233.184.237 as permitted sender) DomainKey-Status: good X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: kingwez@gmail.com X-Host-Fingerprint: 64.233.184.237 wr-out-0506.google.com Linux 2.4/2.6 Received: from [64.233.184.237] ([64.233.184.237:58618] helo=wr-out-0506.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id AE/E1-33107-2C70C254 for ; Tue, 10 Oct 2006 16:51:14 -0400 Received: by wr-out-0506.google.com with SMTP id 69so410640wri for ; Tue, 10 Oct 2006 13:51:12 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=GUIoF9WrFKHbf+Si1ma6eOQOr7z0buf2t1wJXsFcYloEgdYv/2d+O3vYpNEgOLPMPR1/F6FYmWB5RGwp19g0pFqyuZj7RzB7pBOUa4as0wK3UpmAw8G68tWxWokbaMl+/ABIdFR4AyNWjkRTAAVL3igRc9gCjMuc3rFrJI1RS/Y= Received: by 10.90.52.18 with SMTP id z18mr4099148agz; Tue, 10 Oct 2006 13:51:12 -0700 (PDT) Received: by 10.90.101.11 with HTTP; Tue, 10 Oct 2006 13:51:12 -0700 (PDT) Message-ID: <4e89b4260610101351j5480012es5e108e21b122c72e@mail.gmail.com> Date: Tue, 10 Oct 2006 16:51:12 -0400 To: "Sara Golemon" Cc: ilia@prohost.org, andrei@gravitonic.com, internals@lists.php.net In-Reply-To: <452A78A1.2020309@php.net> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <452A78A1.2020309@php.net> Subject: Re: PDO/Unicode Migration Strategies From: kingwez@gmail.com ("Wez Furlong") Let's talk about this at the Zend conference. I've only had time to skim your email, but what I had in mind is more C than A or B on your list. the PHP 5.2 branch is where it's at for PDO, as the unicode APIs were changing too wildly to make it feasible for me to keep everything in sync. Given my lack of free time, I'm still in favour of finding a way to avoid branching and merging PDO too much while we maintain the PHP 5 branch, as there are so many PDO extensions that it will become very easy (for me at least) to forget a vital merge. --Wez. On 10/9/06, Sara Golemon wrote: > PDO Devs, et. al.; > > It's that time, time to start looking at PDO's plans for the future, > specifically how it'll integrate with the wild world of unicode. After > working with the sqlite2 native driver and reading up on some of the > other RDBMs, I've come up with a few scenarios of varying merit for PDO > that I'd like to bounce against y'all and the world at large. > > (A) PDO downcodes all inbound unicode data (SQL statements, bound > params, etc...) to UTF8, and upconverts return data (results) from UTF8 > to UTF16 (UChar type) on return (when UG(unicode) is enabled). > > Pros: No changes to the dbh/stmt handler APIs. > Cons: Changes to assumptions made by many (most?) drivers. > Anywhere non-utf8 data (e.g. latin1) is expected, the data will > have to be re-converted. > Doesn't cleanly account for binary strings passed in which are > not already utf-8 encoded which could easily lead to wtf when in > non-unicode semantics mode (normal case for many/most users). Moreso > when the driver is trying to decide if it can use the data it received > as-is, or if it has to transcode to get to the right charset. > > > (B) Change all string handling APIs (e.g. do/execute/fetch ) to include > a type field (zend_uchar str_type, zstr str, int str_len) so that > drivers get unicode as UChar*, and non-unicode as char*. > > Pros: Leaves character set handling to the driver which is best > equiped to make decisions about its quirks. > Binary (most likely localized) data is recognized as such and > can be handled appropriately. > Cons: Puts more work on the actual driver to handle unicode conversion. > Leads to lots of #ifdef macrory since drivers live in PECL and > must still be compilable on PHP5. > > > (C) Add a UConverter *encoding_conv; element to pdo_dbh and pdo_stmt > objects, and an INI setting: pdo.default_encoding. When passing data > to/from a stmt object, the statement objects encoder is used if > available (set during prepare), if not available the driver's converter > is used (set by factory), otherwise pdo.default_encoding is used as a > fallback. Data exchanges between the dbh object are similarly handled > though (obviously) skipping the stmt step. > > Pros: Keeps character set conversion work out of the driver layer. > Reduces the amount of #ifdef work for multiple version support. > Recognizes that some drivers (SQLITE) use a single encoding > universally, while others allow different tables to use different encodings. > Cons: Doesn't solve the "do()" problem of encoding to different > charsets when inserting to tables of a driver which allows different > charsets per table. > Doesn't provide an indicator which says "This came from a > unicode string and was converter by ICU so is reliably in the correct > encoding" versus "This was handed to me by the user as a binary string > and may contain anything". Though this is also "fixable" by either > changing the handler proto or by burying a state flag in the dbh/stmt > objects. > > > Personally I like option C the best as it presents the least amount of > work for individual drivers, costs the least in terms of version/ifdefs, > and provides a reasonable degree of flexibility. > > As mentioned however, only B provides information to the driver on the > reliability of the encoding "Is this *really* utf8? Or am I going to > find a stray \xA0 in here somewhere?" Of course, we currently have no > such assurance, the user is simply expected to give the driver well > formed data, if they don't they're SOL already. > > I generally don't like A as it's the most wasteful and really doesn't > solve the difficult problems. > > Any rate, share your thoughts.. > > -Sara > > P.S. - Where is primary PDO development happening? Last I heard PECL > releases were coming out of the 5.1 branch and that was the place to be. > Has HEAD been kept in sync? >