Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:60737 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 9510 invoked from network); 4 Jun 2012 19:09:35 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 4 Jun 2012 19:09:35 -0000 Authentication-Results: pb1.pair.com smtp.mail=smalyshev@sugarcrm.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=smalyshev@sugarcrm.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain sugarcrm.com designates 67.192.241.193 as permitted sender) X-PHP-List-Original-Sender: smalyshev@sugarcrm.com X-Host-Fingerprint: 67.192.241.193 smtp193.dfw.emailsrvr.com Linux 2.6 Received: from [67.192.241.193] ([67.192.241.193:45340] helo=smtp193.dfw.emailsrvr.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 55/B8-01110-CE70DCF4 for ; Mon, 04 Jun 2012 15:09:33 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp9.relay.dfw1a.emailsrvr.com (SMTP Server) with ESMTP id 1335D3C0442; Mon, 4 Jun 2012 15:09:30 -0400 (EDT) X-Virus-Scanned: OK Received: by smtp9.relay.dfw1a.emailsrvr.com (Authenticated sender: smalyshev-AT-sugarcrm.com) with ESMTPSA id 9962C3C00F7; Mon, 4 Jun 2012 15:09:29 -0400 (EDT) Message-ID: <4FCD07E8.5050509@sugarcrm.com> Date: Mon, 04 Jun 2012 12:09:28 -0700 Organization: SugarCRM User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Gustavo Lopes CC: internals PHP References: <4FC90A71.5090909@sugarcrm.com> <83b8d9541f7b5ea2d6d9fd98aea03bb7@nebm.ist.utl.pt> In-Reply-To: <83b8d9541f7b5ea2d6d9fd98aea03bb7@nebm.ist.utl.pt> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] BreakIterator From: smalyshev@sugarcrm.com (Stas Malyshev) Hi! > You can create a RuleBasedBreakIterator with any rules you choose. The I understand that, but I have no idea how to write proper rules for word boundaries, I just want to tell it "give me word boundaries" but not by saying createWordBoundaries() but by doing createIterator($type) where $type == WORD_BOUNDARIES. > To iterate over code points, you can build a very simple > RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this > example here: https://gist.github.com/2843005 Is there any reason not to provide this as a service for PHP user? I understand somebody who is a specialist in ICU knows that already, but most PHP users don't know this magic. > Right now, the ICU implementation just calls > Locale::getAvailableLocales(), but its description is "Gets all the > available locales that has localized text boundary data." so I suppose > it could return a different set in the future. My only concern is that no other classes have getAvailableLocales() and it doesn't seem to do anything useful now, so maybe we should omit it for now? > Acknowledging that getting the text between the boundaries was going to > be a common scenario, I added a method, getPartsIterator(), that yields > the text between each boundary. Hence, there is one less element in this > iterator than in the BreakIterator. > > Neither of the iterators implement getKey(), so one traversing the keys > will be 0, 1, 2... It would probably be a good a idea to change the > parts iterator to give the left boundary as the key. That way on could > do: > > $bi = BreakIterator::createWordInstance(NULL); > $bi->setText($foo); > foreach ($bi->getPartsIterator() as $k => $v) { > echo "$v is at position $k\n"; > } Another thing I notice here: why not make: $bi = BreakIterator::createWordInstance(NULL); $bi->setText($foo); into: $bi = BreakIterator::createWordInstance(NULL, $foo); This provides for less boilerplate code, since if you are creating iterator chances are you have some string to iterate over already. > Another possibility would be to have the break iterator itself behave > as the parts iterator for iteration purposes. I don't think that is a > good idea. Even though BreakIterator does not implement Iterator, people > would expect next() and current() return the next and current iterator > value, while they would be returning the iteration key. OK, if you have to do getPartsIterator() it's fine as long as you can easily do foreach on it, since that's what one expects from iterator. I'd also add some flag that would skip or not skip whitespace, if this is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar'] and sometimes you want just ['foo', 'bar'] - does ICU support it somehow? Again, having some full description of proposed API would be nice. For example, what hashCode() does? -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ (408)454-6900 ext. 227