Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:60738
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain nebm.ist.utl.pt from 193.136.128.21 cause and error)
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
To: "internals@lists.php.net" <internals@lists.php.net>, "Stas Malyshev"
 <smalyshev@sugarcrm.com>
References: <ef6e6488aa70f76ccfe540098b54de83@nebm.ist.utl.pt>
 <4FC90A71.5090909@sugarcrm.com>
 <83b8d9541f7b5ea2d6d9fd98aea03bb7@nebm.ist.utl.pt>
 <4FCD07E8.5050509@sugarcrm.com>
Date: Mon, 04 Jun 2012 23:08:39 +0200
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Organization: =?utf-8?Q?N=C3=BAcleo_de_Eng=2E_Biom=C3=A9di?=
 =?utf-8?Q?ca_do_I=2ES=2ET=2E?=
Message-ID: <op.wfec0pqqidpuyk@damnation.nl.lo.geleia.net>
In-Reply-To: <4FCD07E8.5050509@sugarcrm.com>
User-Agent: Opera Mail/11.64 (Win32)
Subject: Re: [PHP-DEV] BreakIterator
From: glopes@nebm.ist.utl.pt ("Gustavo Lopes")

On Mon, 04 Jun 2012 21:09:28 +0200, Stas Malyshev <smalyshev@sugarcrm.com>  
wrote:

> I understand that, but I have no idea how to write proper rules for word
> boundaries, I just want to tell it "give me word boundaries" but not by
> saying createWordBoundaries() but by doing createIterator($type) where
> $type == WORD_BOUNDARIES.

Why? This makes no sense to me. Why would createIterator(WORD_BOUNDARIES)  
be better than BreakIterator::createWordInstance()? Especially in a  
dynamic language like PHP where you can do:

$type = 'word';
$bi = BreakIterator::{"create" . $type . 'instance'}(NULL);

>> To iterate over code points, you can build a very simple
>> RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
>> example here: https://gist.github.com/2843005
>
> Is there any reason not to provide this as a service for PHP user? I
> understand somebody who is a specialist in ICU knows that already, but
> most PHP users don't know this magic.

Well, the reason I didn't add it is because ICU didn't add such an  
iterator. I imagine the reason for that is that there are much more  
efficient ways to iterate over UTF-8 that don't involve a full-blown regex  
based text segmentation engine. In fact, ICU provides very efficient ways  
(with macros and simple specialized functions) to iterate over UTF-8 text  
in utf8.h:

http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/common/unicode/utf8.h

>
>> Right now, the ICU implementation just calls
>> Locale::getAvailableLocales(), but its description is "Gets all the
>> available locales that has localized text boundary data." so I suppose
>> it could return a different set in the future.
>
> My only concern is that no other classes have getAvailableLocales() and
> it doesn't seem to do anything useful now, so maybe we should omit it
> for now?

I have no special love for it, but your statement is innacurate in one  
aspect -- I've added a similar function in IntlCalendar... whose  
implementation is basically the same:

http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/i18n/calendar.cpp#getAvailableLocales

I don't mind removing both though.

> Another thing I notice here: why not make:
> $bi = BreakIterator::createWordInstance(NULL);
> $bi->setText($foo);
>
> into:
> $bi = BreakIterator::createWordInstance(NULL, $foo);
>

Two reasons:

* it encourages bad behavior, namely not reusing the BreakIterator objects.
* that's not the ICU signature. If ICU in the future adds overloads with a  
string in the second argument, we'll find ourselves with odd signatures.

> OK, if you have to do getPartsIterator() it's fine as long as you can
> easily do foreach on it, since that's what one expects from iterator.
> I'd also add some flag that would skip or not skip whitespace, if this
> is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar']
> and sometimes you want just ['foo', 'bar'] - does ICU support it somehow?

The BreakIterator cannot throws away text. You have to look at the rules  
statuses. Example:

$text = 'This is a phrase... with some punctuation.';
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($text);
foreach ($bi->getPartsIterator() as $v) {
	if ($bi->getRuleStatus() > BreakIterator::WORD_NONE_LIMIT)
		var_dump($v);
}

string(4) "This"
string(2) "is"
string(1) "a"
string(6) "phrase"
string(4) "with"
string(4) "some"
string(11) "punctuation"

> Again, having some full description of proposed API would be nice.
> For example, what hashCode() does?

The ICU docs only say "Compute a hash code for this BreakIterator." If I'm  
not mistaken from my quick glance at the source, it just returns the  
length of the forward rules.

-- 
Gustavo Lopes