Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:60698
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.160.42 as permitted sender)
Message-ID: <4FC80163.3030500@gmail.com>
Date: Fri, 01 Jun 2012 09:40:19 +1000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1
MIME-Version: 1.0
To: Gustavo Lopes <glopes@nebm.ist.utl.pt>
CC: internals@lists.php.net
References: <ef6e6488aa70f76ccfe540098b54de83@nebm.ist.utl.pt>
In-Reply-To: <ef6e6488aa70f76ccfe540098b54de83@nebm.ist.utl.pt>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] BreakIterator
From: davidkmuir@gmail.com (David Muir)

Coming from a "pleb", my only concern is the name if the class is in the
global scope. A "BreakIterator" to me sounds like something related to
breaking out of a looping structure, and not something used for
iterating over various language structure boundaries.
If it's in a ICU namespace, then it's not a problem, as it's clearly
related to Unicode.

Cheers,
David

On 31/05/12 21:21, Gustavo Lopes wrote:
> Hi
>
> I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped
> short of adding a procedural interface. I think there's a larger
> expectation of a having an OOP interface when working with iterators.
> What do you think? If there's no procedural interface, I'll change the
> instances of zend_parse_methods to zpp for performance.
>
> Now I'll copy the commit message here if someone want to comment on a
> specific point inline:
>
> ----
> BreakIterator and RuleBasedBreakiterator added
> This commit adds wrappers for the classes BreakIterator and
> RuleBasedbreakIterator. The C++ ICU classes are described here:
> <http://icu-project.org/apiref/icu4c/classBreakIterator.html>
> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html>
>
> Additionally, a tutorial is available at:
> <http://userguide.icu-project.org/boundaryanalysis>
>
> This implementation wraps UTF-8 text in a UText. The text is
> iterated without any copying or conversion to UTF-16. There is
> also no validation that the input is actually UTF-8; where there
> are malformed sequences, the UText will simply U+FFFD.
>
> The class BreakIterator cannot be instantiated directly (has a
> private constructor). It provides the interface exposed by the ICU
> abstract class with the same name. The PHP class is not abstract
> because we may use it to wrap native subclasses of BreakIterator
> that we don't know how to wrap. This class includes methods to
> move the iterator position to the beginning (first()), to the
> end (last()), forward (next()), backwards (previous()), to the
> boundary preceding a certain position (preceding()) and following
> a certain position (following()) and to obtain the current position
> (current()). next() can also be used to advance or recede an
> arbitrary number of positions.
>
> BreakIterator also exposes other native methods:
> getAvailableLocales(), getLocale() and factory methods to build
> several predefined types of BreakIterators: createWordInstance()
> for word boundaries, createCharacterInstance() for locale
> dependent notions of "characters", createSentenceInstance() for
> sentences, createLineInstance() and createTitleInstance() -- for
> title casing breaks. These factories currently return
> RuleBasedbreakIterators where the names of the rule sets are found
> in the ICU data, observing the passed locale (although the locale
> is taken into considering there are very few exceptions to the
> root rules).
>
> The clone and compare_object PHP object handlers are also
> implemented, though the comparison does not yield meaningful results
> when used with >, <, >= and <=.
>
> Note that BreakIterator is an iterator only in the sense of the
> first 'Iterator' in 'IteratorIterator', i.e., it does not
> implement the Iterator interface. The reason is that there is
> no sensible implementation for Iterator::key(). Using it for
> an ordinal of the current boundary is not feasible because
> we are allowed to move to any boundary at any time. It we were
> to determine the current ordinal when last() is called we'd
> have to traverse the whole input text to find out how many
> breaks there were before. Therefore, BreakIterator implements
> only Traversable. It can be wrapped in an IteratorIterator,
> but the usual warnings apply.
>
> Finally, I added a convenience method to BreakIterator:
> getPartsIterator(). This provides an IntlIterator, backed
> by the BreakIterator PHP object (i.e. moving the pointer or
> changing the text in BreakIterator affects the iterator
> and also moving the iterator affects the backing BreakIterator),
> which allows traversing the text between each boundary.
> This iterator uses the original text to retrieve the text
> between two positions, not the code points returned by the
> wrapping UText. Therefore, if the text includes invalid code
> unit sequences, these invalid sequences will be in the output
> of this iterator, not U+FFFD code points.
>
> The class RuleBasedIterator exposes a constructor that allows
> building an iterator from arbitrary compiled or non-compiled
> rules. The form of these rules in described in the tutorial linked
> above. The rest of the methods allow retrieving the rules --
> getRules() and getCompiledRules() --, a hash code of the rule set
> (hashCode()) and the rules statuses (getRuleStatus() and
> getRuleStatusVec()).
>
> Because the RuleBasedBreakIterator constructor may return parse
> errors, I reuse the UParseError to text function that was in the
> transliterator files. Therefore, I move that function to
> intl_error.c.
>
> common_enum.cpp was also changed, mainly to expose previously
> static functions. This avoided code duplication when implementing
> the BreakIterator iterator and the IntlIterator returned by
> BreakIterator::getPartsIterator().
>