Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:60721
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain nebm.ist.utl.pt from 193.136.128.21 cause and error)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
 format=flowed
Content-Transfer-Encoding: 7bit
Date: Fri, 01 Jun 2012 23:29:08 +0200
To: Stas Malyshev <smalyshev@sugarcrm.com>, internals PHP
 <internals@lists.php.net>
Organization: =?UTF-8?Q?N=C3=BAcleo_de_Engenharia_Biom=C3=A9dica_do_Insti?=
 =?UTF-8?Q?tuto_Superior_T=C3=A9cnico?=
In-Reply-To: <4FC90A71.5090909@sugarcrm.com>
References: <ef6e6488aa70f76ccfe540098b54de83@nebm.ist.utl.pt>
 <4FC90A71.5090909@sugarcrm.com>
Message-ID: <83b8d9541f7b5ea2d6d9fd98aea03bb7@nebm.ist.utl.pt>
User-Agent: RoundCube Webmail/0.5.3
Subject: Re: [PHP-DEV] BreakIterator
From: glopes@nebm.ist.utl.pt (Gustavo Lopes)

On Fri, 01 Jun 2012 11:31:13 -0700, Stas Malyshev wrote:
>
>> BreakIterator also exposes other native methods:
>> getAvailableLocales(), getLocale() and factory methods to build
>> several predefined types of BreakIterators: createWordInstance()
>> for word boundaries, createCharacterInstance() for locale
>> dependent notions of "characters", createSentenceInstance() for
>> sentences, createLineInstance() and createTitleInstance() -- for
>> title casing breaks. These factories currently return
>
> One thing I notice here is that with this API it is not possible to
> programmatically choose what is the iteration unit - you'd have to do 
> a
> switch for that. Do you think it may be a good idea to have a generic
> function that allows to choose the unit programmatically?

You can create a RuleBasedBreakIterator with any rules you choose. The 
rules are basically a set of regex expressions; ICU has two matching 
modes -- by default it tries the longest match, but it can also chain 
together rules. There are rules to advance, to go back and to go to a 
safe position from an arbitrary position in the two directions. The ICU 
user guide to which I linked in the first e-mail has more details.

> What is the notion of characters - is it grapheme characters? Is 
> there
> option to iterate over code points too - not sure if it's useful just
> curious, as we used to have it in PHP 6 IIRC.

Yes, they are grapheme clusters. ICU has a special rule for Thai, but 
from I see in the tracker, it's obsolete with recent versions of Unicode 
(possibly the root rule is now generic enough).

To iterate over code points, you can build a very simple 
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this 
example here: https://gist.github.com/2843005

>
> About getAvailableLocales() - what this actually does? Does it list 
> all
> avaliable locales in the system, ones that have BreakIterator rules, 
> or
> something else? If it's not related to BI, I'm not sure we need to 
> have
> it in BI. What is the intended usage of it? Maybe it should be part 
> of
> Locale class?

Right now, the ICU implementation just calls 
Locale::getAvailableLocales(), but its description is "Gets all the 
available locales that has localized text boundary data." so I suppose 
it could return a different set in the future.

>> Note that BreakIterator is an iterator only in the sense of the
>> first 'Iterator' in 'IteratorIterator', i.e., it does not
>> implement the Iterator interface. The reason is that there is
>> no sensible implementation for Iterator::key(). Using it for
>
> Doesn't it have a notion of current position? If so, key should be 
> the
> current position.
>
> Will this BreakIterator be usable in foreach? I'm not sure I 
> understand
> it from this description - understanding this without any usage
> examples, RFCs or code snippets for intended usage is really hard and 
> I
> think we should really start with doing that. I would expect this 
> class
> to work like this:
>
> foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
> $word) {
>    echo "Word number $i is $word\n";
> }
>
> or at least like this:
>
> foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
> $word) {
>    echo "Next word at position $i is: $word\n";
> }
>
> Is it the model? If not, I think we need to wrap the C API to make 
> this
> possible, because this is what people expect in PHP from the 
> iterator.

My options here were: the BreakIterator mirrors the ICU homonym -- it 
iterates over breaks, i.e., boundaries in the text. Hence, the iterators 
returns the *positions* of the several boundaries. Therefore, this 
cannot be used also for the key.

Acknowledging that getting the text between the boundaries was going to 
be a common scenario, I added a method, getPartsIterator(), that yields 
the text between each boundary. Hence, there is one less element in this 
iterator than in the BreakIterator.

Neither of the iterators implement getKey(), so one traversing the keys 
will be 0, 1, 2... It would probably be a good a idea to change the 
parts iterator to give the left boundary as the key. That way on  could 
do:

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
foreach ($bi->getPartsIterator() as $k => $v) {
     echo "$v is at position $k\n";
}

instead of

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
$pos = $bi->first();
foreach ($bi->getPartsIterator() as $v) {
     echo "$v is at position $pos\n";
     $pos = $bi->current();
}

Another possibility would be to have the break iterator itself behave 
as the parts iterator for iteration purposes. I don't think that is a 
good idea. Even though BreakIterator does not implement Iterator, people 
would expect next() and current() return the next and current iterator 
value, while they would be returning the iteration key.

By the way, you can look at the test cases in the tree on github for 
examples: 
https://github.com/cataphract/php-src/commit/d289c3977ed4ba8d9ba127e5af9f709b19b8e1ba

Thanks for the comments!

-- 
Gustavo Lopes