Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:60698 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 22059 invoked from network); 31 May 2012 23:40:18 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 31 May 2012 23:40:18 -0000 Authentication-Results: pb1.pair.com smtp.mail=davidkmuir@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=davidkmuir@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.160.42 as permitted sender) X-PHP-List-Original-Sender: davidkmuir@gmail.com X-Host-Fingerprint: 209.85.160.42 mail-pb0-f42.google.com Received: from [209.85.160.42] ([209.85.160.42:50077] helo=mail-pb0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E4/7E-45898-06108CF4 for ; Thu, 31 May 2012 19:40:17 -0400 Received: by pbbrp12 with SMTP id rp12so2343059pbb.29 for ; Thu, 31 May 2012 16:40:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=H1PuOmfTNLj8YthabypunazA3J+qw7KhyrjW2LOS31Q=; b=Gz8G9uNRjdhmf0er2fAb6lU6QxDPiPgGfTDQy6sdKfV7fdkZuRjuWTc5tqv1MSBm4n WFt/epXE+iRJNTWafFCX4/RcOwQD8FqnhJ3jd/F+sxtv2//Sv2LbVV5MDEOih54o1uJA e5V7tgwdkFSeJ1JHshTQ+aYJlcuLy0OUXvR31DW0J7S1/bv6Ld7PijK5hI8ydL3XXjGI FPPqBRhgV/ue+x6M1CPO5Oa7dMr5KWpolcjhckUFdrHV7QaFzee4qsxEgcHPe/l1UDMw qxwaefCiRi4u41i+2yjzfGBBrTQVWu8/m5y21qJ6pCZzV524E26IeJ/HwRqqeicX7xj8 48cA== Received: by 10.68.135.165 with SMTP id pt5mr4361537pbb.71.1338507613059; Thu, 31 May 2012 16:40:13 -0700 (PDT) Received: from [192.168.0.5] (dsl-202-173-152-56.vic.westnet.com.au. [202.173.152.56]) by mx.google.com with ESMTPS id jp5sm642154pbc.2.2012.05.31.16.40.11 (version=SSLv3 cipher=OTHER); Thu, 31 May 2012 16:40:12 -0700 (PDT) Message-ID: <4FC80163.3030500@gmail.com> Date: Fri, 01 Jun 2012 09:40:19 +1000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Gustavo Lopes CC: internals@lists.php.net References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] BreakIterator From: davidkmuir@gmail.com (David Muir) Coming from a "pleb", my only concern is the name if the class is in the global scope. A "BreakIterator" to me sounds like something related to breaking out of a looping structure, and not something used for iterating over various language structure boundaries. If it's in a ICU namespace, then it's not a problem, as it's clearly related to Unicode. Cheers, David On 31/05/12 21:21, Gustavo Lopes wrote: > Hi > > I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped > short of adding a procedural interface. I think there's a larger > expectation of a having an OOP interface when working with iterators. > What do you think? If there's no procedural interface, I'll change the > instances of zend_parse_methods to zpp for performance. > > Now I'll copy the commit message here if someone want to comment on a > specific point inline: > > ---- > BreakIterator and RuleBasedBreakiterator added > This commit adds wrappers for the classes BreakIterator and > RuleBasedbreakIterator. The C++ ICU classes are described here: > > > > Additionally, a tutorial is available at: > > > This implementation wraps UTF-8 text in a UText. The text is > iterated without any copying or conversion to UTF-16. There is > also no validation that the input is actually UTF-8; where there > are malformed sequences, the UText will simply U+FFFD. > > The class BreakIterator cannot be instantiated directly (has a > private constructor). It provides the interface exposed by the ICU > abstract class with the same name. The PHP class is not abstract > because we may use it to wrap native subclasses of BreakIterator > that we don't know how to wrap. This class includes methods to > move the iterator position to the beginning (first()), to the > end (last()), forward (next()), backwards (previous()), to the > boundary preceding a certain position (preceding()) and following > a certain position (following()) and to obtain the current position > (current()). next() can also be used to advance or recede an > arbitrary number of positions. > > BreakIterator also exposes other native methods: > getAvailableLocales(), getLocale() and factory methods to build > several predefined types of BreakIterators: createWordInstance() > for word boundaries, createCharacterInstance() for locale > dependent notions of "characters", createSentenceInstance() for > sentences, createLineInstance() and createTitleInstance() -- for > title casing breaks. These factories currently return > RuleBasedbreakIterators where the names of the rule sets are found > in the ICU data, observing the passed locale (although the locale > is taken into considering there are very few exceptions to the > root rules). > > The clone and compare_object PHP object handlers are also > implemented, though the comparison does not yield meaningful results > when used with >, <, >= and <=. > > Note that BreakIterator is an iterator only in the sense of the > first 'Iterator' in 'IteratorIterator', i.e., it does not > implement the Iterator interface. The reason is that there is > no sensible implementation for Iterator::key(). Using it for > an ordinal of the current boundary is not feasible because > we are allowed to move to any boundary at any time. It we were > to determine the current ordinal when last() is called we'd > have to traverse the whole input text to find out how many > breaks there were before. Therefore, BreakIterator implements > only Traversable. It can be wrapped in an IteratorIterator, > but the usual warnings apply. > > Finally, I added a convenience method to BreakIterator: > getPartsIterator(). This provides an IntlIterator, backed > by the BreakIterator PHP object (i.e. moving the pointer or > changing the text in BreakIterator affects the iterator > and also moving the iterator affects the backing BreakIterator), > which allows traversing the text between each boundary. > This iterator uses the original text to retrieve the text > between two positions, not the code points returned by the > wrapping UText. Therefore, if the text includes invalid code > unit sequences, these invalid sequences will be in the output > of this iterator, not U+FFFD code points. > > The class RuleBasedIterator exposes a constructor that allows > building an iterator from arbitrary compiled or non-compiled > rules. The form of these rules in described in the tutorial linked > above. The rest of the methods allow retrieving the rules -- > getRules() and getCompiledRules() --, a hash code of the rule set > (hashCode()) and the rules statuses (getRuleStatus() and > getRuleStatusVec()). > > Because the RuleBasedBreakIterator constructor may return parse > errors, I reuse the UParseError to text function that was in the > transliterator files. Therefore, I move that function to > intl_error.c. > > common_enum.cpp was also changed, mainly to expose previously > static functions. This avoided code duplication when implementing > the BreakIterator iterator and the IntlIterator returned by > BreakIterator::getPartsIterator(). >