Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:60738 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 15656 invoked from network); 4 Jun 2012 21:08:52 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 4 Jun 2012 21:08:52 -0000 Authentication-Results: pb1.pair.com smtp.mail=glopes@nebm.ist.utl.pt; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=glopes@nebm.ist.utl.pt; sender-id=unknown Received-SPF: error (pb1.pair.com: domain nebm.ist.utl.pt from 193.136.128.21 cause and error) X-PHP-List-Original-Sender: glopes@nebm.ist.utl.pt X-Host-Fingerprint: 193.136.128.21 smtp1.ist.utl.pt Linux 2.6 Received: from [193.136.128.21] ([193.136.128.21:53121] helo=smtp1.ist.utl.pt) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 3E/79-01110-0E32DCF4 for ; Mon, 04 Jun 2012 17:08:50 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp1.ist.utl.pt (Postfix) with ESMTP id AE4B870003C7; Mon, 4 Jun 2012 22:08:45 +0100 (WEST) X-Virus-Scanned: by amavisd-new-2.6.4 (20090625) (Debian) at ist.utl.pt Received: from smtp1.ist.utl.pt ([127.0.0.1]) by localhost (smtp1.ist.utl.pt [127.0.0.1]) (amavisd-new, port 10025) with LMTP id j3QNsECARU9X; Mon, 4 Jun 2012 22:08:45 +0100 (WEST) Received: from mail2.ist.utl.pt (mail.ist.utl.pt [IPv6:2001:690:2100:1::8]) by smtp1.ist.utl.pt (Postfix) with ESMTP id 460887000455; Mon, 4 Jun 2012 22:08:45 +0100 (WEST) Received: from damnation.nl.lo.geleia.net (damnation.nl.lo.geleia.net [IPv6:2001:470:94a2:4:4866:11bc:1688:8089]) (Authenticated sender: ist155741) by mail2.ist.utl.pt (Postfix) with ESMTPSA id 6D0992003522; Mon, 4 Jun 2012 22:08:43 +0100 (WEST) Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes To: "internals@lists.php.net" , "Stas Malyshev" References: <4FC90A71.5090909@sugarcrm.com> <83b8d9541f7b5ea2d6d9fd98aea03bb7@nebm.ist.utl.pt> <4FCD07E8.5050509@sugarcrm.com> Date: Mon, 04 Jun 2012 23:08:39 +0200 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Organization: =?utf-8?Q?N=C3=BAcleo_de_Eng=2E_Biom=C3=A9di?= =?utf-8?Q?ca_do_I=2ES=2ET=2E?= Message-ID: In-Reply-To: <4FCD07E8.5050509@sugarcrm.com> User-Agent: Opera Mail/11.64 (Win32) Subject: Re: [PHP-DEV] BreakIterator From: glopes@nebm.ist.utl.pt ("Gustavo Lopes") On Mon, 04 Jun 2012 21:09:28 +0200, Stas Malyshev wrote: > I understand that, but I have no idea how to write proper rules for word > boundaries, I just want to tell it "give me word boundaries" but not by > saying createWordBoundaries() but by doing createIterator($type) where > $type == WORD_BOUNDARIES. Why? This makes no sense to me. Why would createIterator(WORD_BOUNDARIES) be better than BreakIterator::createWordInstance()? Especially in a dynamic language like PHP where you can do: $type = 'word'; $bi = BreakIterator::{"create" . $type . 'instance'}(NULL); >> To iterate over code points, you can build a very simple >> RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this >> example here: https://gist.github.com/2843005 > > Is there any reason not to provide this as a service for PHP user? I > understand somebody who is a specialist in ICU knows that already, but > most PHP users don't know this magic. Well, the reason I didn't add it is because ICU didn't add such an iterator. I imagine the reason for that is that there are much more efficient ways to iterate over UTF-8 that don't involve a full-blown regex based text segmentation engine. In fact, ICU provides very efficient ways (with macros and simple specialized functions) to iterate over UTF-8 text in utf8.h: http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/common/unicode/utf8.h > >> Right now, the ICU implementation just calls >> Locale::getAvailableLocales(), but its description is "Gets all the >> available locales that has localized text boundary data." so I suppose >> it could return a different set in the future. > > My only concern is that no other classes have getAvailableLocales() and > it doesn't seem to do anything useful now, so maybe we should omit it > for now? I have no special love for it, but your statement is innacurate in one aspect -- I've added a similar function in IntlCalendar... whose implementation is basically the same: http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/i18n/calendar.cpp#getAvailableLocales I don't mind removing both though. > Another thing I notice here: why not make: > $bi = BreakIterator::createWordInstance(NULL); > $bi->setText($foo); > > into: > $bi = BreakIterator::createWordInstance(NULL, $foo); > Two reasons: * it encourages bad behavior, namely not reusing the BreakIterator objects. * that's not the ICU signature. If ICU in the future adds overloads with a string in the second argument, we'll find ourselves with odd signatures. > OK, if you have to do getPartsIterator() it's fine as long as you can > easily do foreach on it, since that's what one expects from iterator. > I'd also add some flag that would skip or not skip whitespace, if this > is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar'] > and sometimes you want just ['foo', 'bar'] - does ICU support it somehow? The BreakIterator cannot throws away text. You have to look at the rules statuses. Example: $text = 'This is a phrase... with some punctuation.'; $bi = BreakIterator::createWordInstance(NULL); $bi->setText($text); foreach ($bi->getPartsIterator() as $v) { if ($bi->getRuleStatus() > BreakIterator::WORD_NONE_LIMIT) var_dump($v); } string(4) "This" string(2) "is" string(1) "a" string(6) "phrase" string(4) "with" string(4) "some" string(11) "punctuation" > Again, having some full description of proposed API would be nice. > For example, what hashCode() does? The ICU docs only say "Compute a hash code for this BreakIterator." If I'm not mistaken from my quick glance at the source, it just returns the length of the forward rules. -- Gustavo Lopes