Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:60721 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 84240 invoked from network); 1 Jun 2012 21:29:12 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 1 Jun 2012 21:29:12 -0000 Authentication-Results: pb1.pair.com header.from=glopes@nebm.ist.utl.pt; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=glopes@nebm.ist.utl.pt; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain nebm.ist.utl.pt from 193.136.128.21 cause and error) X-PHP-List-Original-Sender: glopes@nebm.ist.utl.pt X-Host-Fingerprint: 193.136.128.21 smtp1.ist.utl.pt Linux 2.6 Received: from [193.136.128.21] ([193.136.128.21:57169] helo=smtp1.ist.utl.pt) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 89/2C-45898-82439CF4 for ; Fri, 01 Jun 2012 17:29:12 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp1.ist.utl.pt (Postfix) with ESMTP id 465DD7000452; Fri, 1 Jun 2012 22:29:09 +0100 (WEST) X-Virus-Scanned: by amavisd-new-2.6.4 (20090625) (Debian) at ist.utl.pt Received: from smtp1.ist.utl.pt ([127.0.0.1]) by localhost (smtp1.ist.utl.pt [127.0.0.1]) (amavisd-new, port 10025) with LMTP id a2wU7fS5UqNb; Fri, 1 Jun 2012 22:29:08 +0100 (WEST) Received: from nebm.ist.utl.pt (unknown [IPv6:2001:690:2100:4::58:1]) by smtp1.ist.utl.pt (Postfix) with ESMTP id A10917000446; Fri, 1 Jun 2012 22:29:08 +0100 (WEST) Received: from localhost ([127.0.0.1] helo=nebm.ist.utl.pt) by nebm.ist.utl.pt with esmtp (Exim 4.72) (envelope-from ) id 1SaZPA-0005FD-Iy; Fri, 01 Jun 2012 22:29:08 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 01 Jun 2012 23:29:08 +0200 To: Stas Malyshev , internals PHP Organization: =?UTF-8?Q?N=C3=BAcleo_de_Engenharia_Biom=C3=A9dica_do_Insti?= =?UTF-8?Q?tuto_Superior_T=C3=A9cnico?= In-Reply-To: <4FC90A71.5090909@sugarcrm.com> References: <4FC90A71.5090909@sugarcrm.com> Message-ID: <83b8d9541f7b5ea2d6d9fd98aea03bb7@nebm.ist.utl.pt> X-Sender: glopes@nebm.ist.utl.pt User-Agent: RoundCube Webmail/0.5.3 Subject: Re: [PHP-DEV] BreakIterator From: glopes@nebm.ist.utl.pt (Gustavo Lopes) On Fri, 01 Jun 2012 11:31:13 -0700, Stas Malyshev wrote: > >> BreakIterator also exposes other native methods: >> getAvailableLocales(), getLocale() and factory methods to build >> several predefined types of BreakIterators: createWordInstance() >> for word boundaries, createCharacterInstance() for locale >> dependent notions of "characters", createSentenceInstance() for >> sentences, createLineInstance() and createTitleInstance() -- for >> title casing breaks. These factories currently return > > One thing I notice here is that with this API it is not possible to > programmatically choose what is the iteration unit - you'd have to do > a > switch for that. Do you think it may be a good idea to have a generic > function that allows to choose the unit programmatically? You can create a RuleBasedBreakIterator with any rules you choose. The rules are basically a set of regex expressions; ICU has two matching modes -- by default it tries the longest match, but it can also chain together rules. There are rules to advance, to go back and to go to a safe position from an arbitrary position in the two directions. The ICU user guide to which I linked in the first e-mail has more details. > What is the notion of characters - is it grapheme characters? Is > there > option to iterate over code points too - not sure if it's useful just > curious, as we used to have it in PHP 6 IIRC. Yes, they are grapheme clusters. ICU has a special rule for Thai, but from I see in the tracker, it's obsolete with recent versions of Unicode (possibly the root rule is now generic enough). To iterate over code points, you can build a very simple RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this example here: https://gist.github.com/2843005 > > About getAvailableLocales() - what this actually does? Does it list > all > avaliable locales in the system, ones that have BreakIterator rules, > or > something else? If it's not related to BI, I'm not sure we need to > have > it in BI. What is the intended usage of it? Maybe it should be part > of > Locale class? Right now, the ICU implementation just calls Locale::getAvailableLocales(), but its description is "Gets all the available locales that has localized text boundary data." so I suppose it could return a different set in the future. >> Note that BreakIterator is an iterator only in the sense of the >> first 'Iterator' in 'IteratorIterator', i.e., it does not >> implement the Iterator interface. The reason is that there is >> no sensible implementation for Iterator::key(). Using it for > > Doesn't it have a notion of current position? If so, key should be > the > current position. > > Will this BreakIterator be usable in foreach? I'm not sure I > understand > it from this description - understanding this without any usage > examples, RFCs or code snippets for intended usage is really hard and > I > think we should really start with doing that. I would expect this > class > to work like this: > > foreach(BreakIterator::createWordInstance("blah blah blah") as $i => > $word) { > echo "Word number $i is $word\n"; > } > > or at least like this: > > foreach(BreakIterator::createWordInstance("blah blah blah") as $i => > $word) { > echo "Next word at position $i is: $word\n"; > } > > Is it the model? If not, I think we need to wrap the C API to make > this > possible, because this is what people expect in PHP from the > iterator. My options here were: the BreakIterator mirrors the ICU homonym -- it iterates over breaks, i.e., boundaries in the text. Hence, the iterators returns the *positions* of the several boundaries. Therefore, this cannot be used also for the key. Acknowledging that getting the text between the boundaries was going to be a common scenario, I added a method, getPartsIterator(), that yields the text between each boundary. Hence, there is one less element in this iterator than in the BreakIterator. Neither of the iterators implement getKey(), so one traversing the keys will be 0, 1, 2... It would probably be a good a idea to change the parts iterator to give the left boundary as the key. That way on could do: $bi = BreakIterator::createWordInstance(NULL); $bi->setText($foo); foreach ($bi->getPartsIterator() as $k => $v) { echo "$v is at position $k\n"; } instead of $bi = BreakIterator::createWordInstance(NULL); $bi->setText($foo); $pos = $bi->first(); foreach ($bi->getPartsIterator() as $v) { echo "$v is at position $pos\n"; $pos = $bi->current(); } Another possibility would be to have the break iterator itself behave as the parts iterator for iteration purposes. I don't think that is a good idea. Even though BreakIterator does not implement Iterator, people would expect next() and current() return the next and current iterator value, while they would be returning the iteration key. By the way, you can look at the test cases in the tree on github for examples: https://github.com/cataphract/php-src/commit/d289c3977ed4ba8d9ba127e5af9f709b19b8e1ba Thanks for the comments! -- Gustavo Lopes