PDO/Unicode Migration Strategies

19 years ago by Sara Golemon — view source — reply

unread

PDO Devs, et. al.;

It's that time, time to start looking at PDO's plans for the future,
specifically how it'll integrate with the wild world of unicode. After
working with the sqlite2 native driver and reading up on some of the
other RDBMs, I've come up with a few scenarios of varying merit for PDO
that I'd like to bounce against y'all and the world at large.

(A) PDO downcodes all inbound unicode data (SQL statements, bound
params, etc...) to UTF8, and upconverts return data (results) from UTF8
to UTF16 (UChar type) on return (when UG(unicode) is enabled).

Pros: No changes to the dbh/stmt handler APIs.
Cons: Changes to assumptions made by many (most?) drivers.
Anywhere non-utf8 data (e.g. latin1) is expected, the data will
have to be re-converted.
Doesn't cleanly account for binary strings passed in which are
not already utf-8 encoded which could easily lead to wtf when in
non-unicode semantics mode (normal case for many/most users). Moreso
when the driver is trying to decide if it can use the data it received
as-is, or if it has to transcode to get to the right charset.

(B) Change all string handling APIs (e.g. do/execute/fetch ) to include
a type field (zend_uchar str_type, zstr str, int str_len) so that
drivers get unicode as UChar*, and non-unicode as char*.

Pros: Leaves character set handling to the driver which is best
equiped to make decisions about its quirks.
Binary (most likely localized) data is recognized as such and
can be handled appropriately.
Cons: Puts more work on the actual driver to handle unicode conversion.
Leads to lots of #ifdef macrory since drivers live in PECL and
must still be compilable on PHP5.

(C) Add a UConverter *encoding_conv; element to pdo_dbh and pdo_stmt
objects, and an INI setting: pdo.default_encoding. When passing data
to/from a stmt object, the statement objects encoder is used if
available (set during prepare), if not available the driver's converter
is used (set by factory), otherwise pdo.default_encoding is used as a
fallback. Data exchanges between the dbh object are similarly handled
though (obviously) skipping the stmt step.

Pros: Keeps character set conversion work out of the driver layer.
Reduces the amount of #ifdef work for multiple version support.
Recognizes that some drivers (SQLITE) use a single encoding
universally, while others allow different tables to use different encodings.
Cons: Doesn't solve the "do()" problem of encoding to different
charsets when inserting to tables of a driver which allows different
charsets per table.
Doesn't provide an indicator which says "This came from a
unicode string and was converter by ICU so is reliably in the correct
encoding" versus "This was handed to me by the user as a binary string
and may contain anything". Though this is also "fixable" by either
changing the handler proto or by burying a state flag in the dbh/stmt
objects.

Personally I like option C the best as it presents the least amount of
work for individual drivers, costs the least in terms of version/ifdefs,
and provides a reasonable degree of flexibility.

As mentioned however, only B provides information to the driver on the
reliability of the encoding "Is this really utf8? Or am I going to
find a stray \xA0 in here somewhere?" Of course, we currently have no
such assurance, the user is simply expected to give the driver well
formed data, if they don't they're SOL already.

I generally don't like A as it's the most wasteful and really doesn't
solve the difficult problems.

Any rate, share your thoughts..

-Sara

P.S. - Where is primary PDO development happening? Last I heard PECL
releases were coming out of the 5.1 branch and that was the place to be.
Has HEAD been kept in sync?

19 years ago by Edin Kadribasic — view source — reply

unread

Sara Golemon wrote:

P.S. - Where is primary PDO development happening? Last I heard PECL
releases were coming out of the 5.1 branch and that was the place to be.
Has HEAD been kept in sync?

No. Probably the best thing at this point would be to merge the whole of
PDO from 5_2 to HEAD.

Edin

19 years ago by Antony Dovgal — view source — reply

unread

Sara Golemon wrote:

P.S. - Where is primary PDO development happening? Last I heard PECL
releases were coming out of the 5.1 branch and that was the place to be.
Has HEAD been kept in sync?

No. Probably the best thing at this point would be to merge the whole of
PDO from 5_2 to HEAD.

Yes, PDO's HEAD is out of sync.

--
Wbr,
Antony Dovgal

19 years ago by Ilia Alshanetsky — view source — reply

unread

(C) Add a UConverter *encoding_conv; element to pdo_dbh and
pdo_stmt objects, and an INI setting: pdo.default_encoding. When
passing data to/from a stmt object, the statement objects encoder
is used if available (set during prepare), if not available the
driver's converter is used (set by factory), otherwise
pdo.default_encoding is used as a fallback. Data exchanges
between the dbh object are similarly handled though (obviously)
skipping the stmt step.

Pros: Keeps character set conversion work out of the driver layer.
Reduces the amount of #ifdef work for multiple version
support.
Recognizes that some drivers (SQLITE) use a single encoding
universally, while others allow different tables to use different
encodings.
Cons: Doesn't solve the "do()" problem of encoding to different
charsets when inserting to tables of a driver which allows
different charsets per table.
Doesn't provide an indicator which says "This came from a
unicode string and was converter by ICU so is reliably in the
correct encoding" versus "This was handed to me by the user as a
binary string and may contain anything". Though this is also
"fixable" by either changing the handler proto or by burying a
state flag in the dbh/stmt objects.

From what you propose I think option C is the most reasonable
solution, but I'd like to offer a few revisions.

PDO already has an API for setting attributes via setAttribute(),
which can be set for a connection (default) and can be modified on a
per-statement via the same method. Attributes can also be passed via
a parameters, this lets the user decide what charsets to send to the
database. In some cases there is a neat cheat that can be applied by
setting connection charset to utf-8 or even utf-16 and let the
database (assuming it does this) do up/down conversion of the data as
needed.

Ilia Alshanetsky

19 years ago by Wez Furlong — view source — reply

unread

Let's talk about this at the Zend conference.
I've only had time to skim your email, but what I had in mind is more
C than A or B on your list.

the PHP 5.2 branch is where it's at for PDO, as the unicode APIs were
changing too wildly to make it feasible for me to keep everything in
sync.

Given my lack of free time, I'm still in favour of finding a way to
avoid branching and merging PDO too much while we maintain the PHP 5
branch, as there are so many PDO extensions that it will become very
easy (for me at least) to forget a vital merge.

--Wez.

PDO Devs, et. al.;

It's that time, time to start looking at PDO's plans for the future,
specifically how it'll integrate with the wild world of unicode. After
working with the sqlite2 native driver and reading up on some of the
other RDBMs, I've come up with a few scenarios of varying merit for PDO
that I'd like to bounce against y'all and the world at large.

(A) PDO downcodes all inbound unicode data (SQL statements, bound
params, etc...) to UTF8, and upconverts return data (results) from UTF8
to UTF16 (UChar type) on return (when UG(unicode) is enabled).

Pros: No changes to the dbh/stmt handler APIs.
Cons: Changes to assumptions made by many (most?) drivers.
Anywhere non-utf8 data (e.g. latin1) is expected, the data will
have to be re-converted.
Doesn't cleanly account for binary strings passed in which are
not already utf-8 encoded which could easily lead to wtf when in
non-unicode semantics mode (normal case for many/most users). Moreso
when the driver is trying to decide if it can use the data it received
as-is, or if it has to transcode to get to the right charset.

(B) Change all string handling APIs (e.g. do/execute/fetch ) to include
a type field (zend_uchar str_type, zstr str, int str_len) so that
drivers get unicode as UChar*, and non-unicode as char*.

Pros: Leaves character set handling to the driver which is best
equiped to make decisions about its quirks.
Binary (most likely localized) data is recognized as such and
can be handled appropriately.
Cons: Puts more work on the actual driver to handle unicode conversion.
Leads to lots of #ifdef macrory since drivers live in PECL and
must still be compilable on PHP5.

(C) Add a UConverter *encoding_conv; element to pdo_dbh and pdo_stmt
objects, and an INI setting: pdo.default_encoding. When passing data
to/from a stmt object, the statement objects encoder is used if
available (set during prepare), if not available the driver's converter
is used (set by factory), otherwise pdo.default_encoding is used as a
fallback. Data exchanges between the dbh object are similarly handled
though (obviously) skipping the stmt step.

Pros: Keeps character set conversion work out of the driver layer.
Reduces the amount of #ifdef work for multiple version support.
Recognizes that some drivers (SQLITE) use a single encoding
universally, while others allow different tables to use different encodings.
Cons: Doesn't solve the "do()" problem of encoding to different
charsets when inserting to tables of a driver which allows different
charsets per table.
Doesn't provide an indicator which says "This came from a
unicode string and was converter by ICU so is reliably in the correct
encoding" versus "This was handed to me by the user as a binary string
and may contain anything". Though this is also "fixable" by either
changing the handler proto or by burying a state flag in the dbh/stmt
objects.

Personally I like option C the best as it presents the least amount of
work for individual drivers, costs the least in terms of version/ifdefs,
and provides a reasonable degree of flexibility.

As mentioned however, only B provides information to the driver on the
reliability of the encoding "Is this really utf8? Or am I going to
find a stray \xA0 in here somewhere?" Of course, we currently have no
such assurance, the user is simply expected to give the driver well
formed data, if they don't they're SOL already.

I generally don't like A as it's the most wasteful and really doesn't
solve the difficult problems.

Any rate, share your thoughts..

-Sara

P.S. - Where is primary PDO development happening? Last I heard PECL
releases were coming out of the 5.1 branch and that was the place to be.
Has HEAD been kept in sync?

19 years ago by Sara Golemon — view source — reply

unread

Let's talk about this at the Zend conference.

I agree with that, the four of us will be ine one place at one time, I just
wanted to start the topic in order to get thoughts from those who won't be
there or won't have time.

I've only had time to skim your email, but what I had in mind is more
C than A or B on your list.

Nod, I like this one best (of the options so far) too. Based on talking to
Ilia I realize the INI option is unnecessary. From what I understand Ilia
would like to see the user required to specify the encoding via a driver
parameter, though I think in many cases we can allow the driver to have a
default which can be overridden via driver param only if necessary.

the PHP 5.2 branch is where it's at for PDO, as the unicode APIs were
changing too wildly to make it feasible for me to keep everything in
sync.

More than understood. Things have settled a bit in terms of the APIs, but
there's still a few tweaks going on.

Given my lack of free time, I'm still in favour of finding a way to
avoid branching and merging PDO too much while we maintain the PHP 5
branch, as there are so many PDO extensions that it will become very
easy (for me at least) to forget a vital merge.

I would think we'd make these changes as ifdef'd bits in whatever branch is
the primary PECL release branch (5.2). They'll be compartmentalized enough
that I think we can avoid making it hard to work around, and it would allow
andrei's preview-release plans to move forward without forcing a new branch
on PDO maintenance.

Basicly, we'd have the same situation we have now. Any release of HEAD
requires bulk copying 5.2's PDO code into HEAD's cvs tree (either via
backend hackery or megacommit to sync it up).

-Sara