BreakIterator

13 years ago by Gustavo Lopes — view source — reply

unread

I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped
short of adding a procedural interface. I think there's a larger
expectation of a having an OOP interface when working with iterators.
What do you think? If there's no procedural interface, I'll change the
instances of zend_parse_methods to zpp for performance.

Now I'll copy the commit message here if someone want to comment on a
specific point inline:

BreakIterator and RuleBasedBreakiterator added
This commit adds wrappers for the classes BreakIterator and
RuleBasedbreakIterator. The C++ ICU classes are described here:
http://icu-project.org/apiref/icu4c/classBreakIterator.html
http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html

Additionally, a tutorial is available at:
http://userguide.icu-project.org/boundaryanalysis

This implementation wraps UTF-8 text in a UText. The text is
iterated without any copying or conversion to UTF-16. There is
also no validation that the input is actually UTF-8; where there
are malformed sequences, the UText will simply U+FFFD.

The class BreakIterator cannot be instantiated directly (has a
private constructor). It provides the interface exposed by the ICU
abstract class with the same name. The PHP class is not abstract
because we may use it to wrap native subclasses of BreakIterator
that we don't know how to wrap. This class includes methods to
move the iterator position to the beginning (first()), to the
end (last()), forward (next()), backwards (previous()), to the
boundary preceding a certain position (preceding()) and following
a certain position (following()) and to obtain the current position
(current()). next() can also be used to advance or recede an
arbitrary number of positions.

BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return
RuleBasedbreakIterators where the names of the rule sets are found
in the ICU data, observing the passed locale (although the locale
is taken into considering there are very few exceptions to the
root rules).

The clone and compare_object PHP object handlers are also
implemented, though the comparison does not yield meaningful results
when used with >, <, >= and <=.

Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for
an ordinal of the current boundary is not feasible because
we are allowed to move to any boundary at any time. It we were
to determine the current ordinal when last() is called we'd
have to traverse the whole input text to find out how many
breaks there were before. Therefore, BreakIterator implements
only Traversable. It can be wrapped in an IteratorIterator,
but the usual warnings apply.

Finally, I added a convenience method to BreakIterator:
getPartsIterator(). This provides an IntlIterator, backed
by the BreakIterator PHP object (i.e. moving the pointer or
changing the text in BreakIterator affects the iterator
and also moving the iterator affects the backing BreakIterator),
which allows traversing the text between each boundary.
This iterator uses the original text to retrieve the text
between two positions, not the code points returned by the
wrapping UText. Therefore, if the text includes invalid code
unit sequences, these invalid sequences will be in the output
of this iterator, not U+FFFD code points.

The class RuleBasedIterator exposes a constructor that allows
building an iterator from arbitrary compiled or non-compiled
rules. The form of these rules in described in the tutorial linked
above. The rest of the methods allow retrieving the rules --
getRules() and getCompiledRules() --, a hash code of the rule set
(hashCode()) and the rules statuses (getRuleStatus() and
getRuleStatusVec()).

Because the RuleBasedBreakIterator constructor may return parse
errors, I reuse the UParseError to text function that was in the
transliterator files. Therefore, I move that function to
intl_error.c.

common_enum.cpp was also changed, mainly to expose previously
static functions. This avoided code duplication when implementing
the BreakIterator iterator and the IntlIterator returned by
BreakIterator::getPartsIterator().

--
Gustavo Lopes

13 years ago by David Muir — view source — reply

unread

Coming from a "pleb", my only concern is the name if the class is in the
global scope. A "BreakIterator" to me sounds like something related to
breaking out of a looping structure, and not something used for
iterating over various language structure boundaries.
If it's in a ICU namespace, then it's not a problem, as it's clearly
related to Unicode.

Cheers,
David

Hi

I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped
short of adding a procedural interface. I think there's a larger
expectation of a having an OOP interface when working with iterators.
What do you think? If there's no procedural interface, I'll change the
instances of zend_parse_methods to zpp for performance.

Now I'll copy the commit message here if someone want to comment on a
specific point inline:

BreakIterator and RuleBasedBreakiterator added
This commit adds wrappers for the classes BreakIterator and
RuleBasedbreakIterator. The C++ ICU classes are described here:
http://icu-project.org/apiref/icu4c/classBreakIterator.html
http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html

Additionally, a tutorial is available at:
http://userguide.icu-project.org/boundaryanalysis

This implementation wraps UTF-8 text in a UText. The text is
iterated without any copying or conversion to UTF-16. There is
also no validation that the input is actually UTF-8; where there
are malformed sequences, the UText will simply U+FFFD.

The class BreakIterator cannot be instantiated directly (has a
private constructor). It provides the interface exposed by the ICU
abstract class with the same name. The PHP class is not abstract
because we may use it to wrap native subclasses of BreakIterator
that we don't know how to wrap. This class includes methods to
move the iterator position to the beginning (first()), to the
end (last()), forward (next()), backwards (previous()), to the
boundary preceding a certain position (preceding()) and following
a certain position (following()) and to obtain the current position
(current()). next() can also be used to advance or recede an
arbitrary number of positions.

BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return
RuleBasedbreakIterators where the names of the rule sets are found
in the ICU data, observing the passed locale (although the locale
is taken into considering there are very few exceptions to the
root rules).

The clone and compare_object PHP object handlers are also
implemented, though the comparison does not yield meaningful results
when used with >, <, >= and <=.

Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for
an ordinal of the current boundary is not feasible because
we are allowed to move to any boundary at any time. It we were
to determine the current ordinal when last() is called we'd
have to traverse the whole input text to find out how many
breaks there were before. Therefore, BreakIterator implements
only Traversable. It can be wrapped in an IteratorIterator,
but the usual warnings apply.

Finally, I added a convenience method to BreakIterator:
getPartsIterator(). This provides an IntlIterator, backed
by the BreakIterator PHP object (i.e. moving the pointer or
changing the text in BreakIterator affects the iterator
and also moving the iterator affects the backing BreakIterator),
which allows traversing the text between each boundary.
This iterator uses the original text to retrieve the text
between two positions, not the code points returned by the
wrapping UText. Therefore, if the text includes invalid code
unit sequences, these invalid sequences will be in the output
of this iterator, not U+FFFD code points.

The class RuleBasedIterator exposes a constructor that allows
building an iterator from arbitrary compiled or non-compiled
rules. The form of these rules in described in the tutorial linked
above. The rest of the methods allow retrieving the rules --
getRules() and getCompiledRules() --, a hash code of the rule set
(hashCode()) and the rules statuses (getRuleStatus() and
getRuleStatusVec()).

Because the RuleBasedBreakIterator constructor may return parse
errors, I reuse the UParseError to text function that was in the
transliterator files. Therefore, I move that function to
intl_error.c.

common_enum.cpp was also changed, mainly to expose previously
static functions. This avoided code duplication when implementing
the BreakIterator iterator and the IntlIterator returned by
BreakIterator::getPartsIterator().

13 years ago by Gustavo Lopes — view source — reply

unread

Coming from a "pleb", my only concern is the name if the class is in
the
global scope. A "BreakIterator" to me sounds like something related
to
breaking out of a looping structure, and not something used for
iterating over various language structure boundaries.
If it's in a ICU namespace, then it's not a problem, as it's clearly
related to Unicode.

We currently don't use namespaces in any of the core extensions. All
the other symbols in ext/intl are in the global namespace; to put
BreakIterator in a new namespace would be inconsistent -- and to put the
whole extension would be a huge BC break.

As to the name chosen to the class, it just mirrors the name used in
ICU. In some cases, we prefixed the class name with Intl, in order to
minimize the likelihood of symbols collisions or distinguish it from
other similar functionality in PHP (something namespaces would be more
appropriate for), but otherwise we prefer to keep the symbols names used
in ICU in order to make it easy for people who already know the native
API.

Additionally, I think your concerns are exaggerated. The symbol
BreakIterator can only used in contexts where it's obvious it's a class
name, as in BreakIterator::createWordInstance('en').

--
Gustavo Lopes

13 years ago by Benjamin Eberlei — view source — reply

unread

How about IntlBreakIterator? I agree with David that the naming is very
weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)

On Fri, Jun 1, 2012 at 9:57 AM, Gustavo Lopes glopes@nebm.ist.utl.ptwrote:

Coming from a "pleb", my only concern is the name if the class is in the
global scope. A "BreakIterator" to me sounds like something related to
breaking out of a looping structure, and not something used for
iterating over various language structure boundaries.
If it's in a ICU namespace, then it's not a problem, as it's clearly
related to Unicode.

We currently don't use namespaces in any of the core extensions. All the
other symbols in ext/intl are in the global namespace; to put BreakIterator
in a new namespace would be inconsistent -- and to put the whole extension
would be a huge BC break.

As to the name chosen to the class, it just mirrors the name used in ICU.
In some cases, we prefixed the class name with Intl, in order to minimize
the likelihood of symbols collisions or distinguish it from other similar
functionality in PHP (something namespaces would be more appropriate for),
but otherwise we prefer to keep the symbols names used in ICU in order to
make it easy for people who already know the native API.

Additionally, I think your concerns are exaggerated. The symbol
BreakIterator can only used in contexts where it's obvious it's a class
name, as in BreakIterator::**createWordInstance('en').

--
Gustavo Lopes

13 years ago by Pierre Joye — view source — reply

unread

hi,

How about IntlBreakIterator? I agree with David that the naming is very
weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)

I agree too. BreakIterator is a very common name and I suspect
possible naming conflicts may happen.

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

13 years ago by Gustavo Lopes — view source — reply

unread

On Fri, Jun 1, 2012 at 10:02 AM, Benjamin Eberlei
kontakt@beberlei.de wrote:

How about IntlBreakIterator? I agree with David that the naming is
very
weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)

Asides from date related classes -- which could be confused with stuff
from ext/date or even ext/calendar --, no other classes have Intl in
their name. Does SpoofChecker hint at something from intl?
ResourceBundle? ICU is a rather large library, and while
internationalization is a common theme, the APIs have diverse
functionality and therefore diverse names. Plus, SPL does not have a
monopoly on the *Iterator names.

I agree too. BreakIterator is a very common name and I suspect
possible naming conflicts may happen.

So would you have RuleBasedBreakIterator renamed
IntlRuleBasedBreakIterator too?... I find it very hard to believe that
"BreakIterator" is "a very common name", but I'm open to evidence that
points otherwise. This argument could maybe be made for
'Transliterator', which was added in 5.4.

--
Gustavo Lopes

13 years ago by Maciek Sokolewicz — view source — reply

unread

On Fri, Jun 1, 2012 at 10:02 AM, Benjamin Eberlei
kontakt@beberlei.de wrote:

How about IntlBreakIterator? I agree with David that the naming is very
weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)

Asides from date related classes -- which could be confused with stuff
from ext/date or even ext/calendar --, no other classes have Intl in
their name. Does SpoofChecker hint at something from intl?
ResourceBundle? ICU is a rather large library, and while
internationalization is a common theme, the APIs have diverse
functionality and therefore diverse names. Plus, SPL does not have a
monopoly on the *Iterator names.

I agree too. BreakIterator is a very common name and I suspect
possible naming conflicts may happen.

So would you have RuleBasedBreakIterator renamed
IntlRuleBasedBreakIterator too?... I find it very hard to believe that
"BreakIterator" is "a very common name", but I'm open to evidence that
points otherwise. This argument could maybe be made for
'Transliterator', which was added in 5.4.

In my personal opinion, all Intl classes should be prefixed with Intl.
It's not so much that BreakIterator is a very common name, but rather a
very ambiguous name that may point to many different things. Just by the
fact that multiple people have already posted here that at first they
thought BreakIterator had something to do with the break statement gives
you a rather solid hint that the function of this class is not
immediately clear. Prefixing it with Intl immediately makes it clear
that it belongs to the Intl superfamily, and limits the potential
misunderstandings a lot. I actually still don't understand why not all
Intl classes are prefixed? Isn't that the usual procedure? eg. for
MySQLi, and pretty much all other extensions?

13 years ago by Gustavo Lopes — view source — reply

unread

In my personal opinion, all Intl classes should be prefixed with
Intl. It's not so much that BreakIterator is a very common name, but
rather a very ambiguous name that may point to many different things.
Just by the fact that multiple people have already posted here that
at
first they thought BreakIterator had something to do with the break
statement gives you a rather solid hint that the function of this
class is not immediately clear. Prefixing it with Intl immediately
makes it clear that it belongs to the Intl superfamily, and limits
the
potential misunderstandings a lot. I actually still don't understand
why not all Intl classes are prefixed? Isn't that the usual
procedure?
eg. for MySQLi, and pretty much all other extensions?

We've had the convention of prefixing function names with some
extension prefix, but this convention has not been as marked for class
names -- perhaps because there were so not many of them and so there
were less collision/confusion problems.

In any case, I'll rename the classes before merging.

--
Gustavo Lopes

13 years ago by Pierre Joye — view source — reply

unread

HI,

In any case, I'll rename the classes before merging.

You may have missed part of my replies. One key part was: to discuss
it before doing anything.

This is only one day discussion and I don't feel like we have a long
term decision about what to do in this area. Before going with this
one only, I would rather prefer to solve this problem once and for all
(other intl classes/cases).

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

13 years ago by Pierre Joye — view source — reply

unread

hi,

So would you have RuleBasedBreakIterator renamed IntlRuleBasedBreakIterator
too?...

Ideally we would yes, while they are less common and less aimed to be
seen as part of another API.

I find it very hard to believe that "BreakIterator" is "a very
common name", but I'm open to evidence that points otherwise. This argument
could maybe be made for 'Transliterator', which was added in 5.4.

Transliterator is not confusing as "BreakIterator", sorry.

I would not care much if there was some longer not so confusing/common
names. But with that one, the risk to conflict with existing may be
too high to do not be discussed.

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

13 years ago by Gustavo Lopes — view source — reply

unread

On Fri, Jun 1, 2012 at 1:34 PM, Gustavo Lopes
glopes@nebm.ist.utl.pt wrote:

So would you have RuleBasedBreakIterator renamed
IntlRuleBasedBreakIterator
too?...

Ideally we would yes, while they are less common and less aimed to be
seen as part of another API.

I find it very hard to believe that "BreakIterator" is "a very
common name", but I'm open to evidence that points otherwise. This
argument
could maybe be made for 'Transliterator', which was added in 5.4.

Transliterator is not confusing as "BreakIterator", sorry.

You removed the quoting that provided context, but I was responding to
your claim that it was a "very common name" and that you "suspected
naming conflicts might happen".

But in fact "Transliterator" is much more confusing than
"BreakIterator". In fact, the name "Transliterator" is an ICU artifact
of the past, that module is now called "Text Transformation" as it
provides a generic text transformation API, not specifically for
transliteration.

--
Gustavo Lopes

13 years ago by Nikita Popov — view source — reply

unread

We currently don't use namespaces in any of the core extensions.
Does anything prevent us from starting to do so?

other symbols in ext/intl are in the global namespace; to put BreakIterator
in a new namespace would be inconsistent -- and to put the whole extension
would be a huge BC break.
It sure would be a bit inconcistent, but if you see it as "All new
Intl classes will go
into the Intl namespace" it makes perfect sense in my eyes. Also, at least in
theory, one could alias all intl classes to namespaced variants (though I'm not
sure that's really necessary.)

Nikita

13 years ago by Gustavo Lopes — view source — reply

unread

On Fri, Jun 1, 2012 at 9:57 AM, Gustavo Lopes
glopes@nebm.ist.utl.pt wrote:

We currently don't use namespaces in any of the core extensions.
Does anything prevent us from starting to do so?

other symbols in ext/intl are in the global namespace; to put
BreakIterator
in a new namespace would be inconsistent -- and to put the whole
extension
would be a huge BC break.
It sure would be a bit inconcistent, but if you see it as "All new
Intl classes will go into the Intl namespace" it makes perfect sense
in my eyes.

You say that it makes perfect sense, but you don't explain why.

Also, at least in theory, one could alias all intl classes to
namespaced variants
(though I'm not sure that's really necessary.)

Yes, that would be the only sane way to do it, but I really don't see a
benefit large enough to compensate having a different treatment for
classes depending on some arbitrary line like when they were added. The
only real benefit of namespaces is to avoid name collisions, but most
new projects use namespaces and we can easily avoid name collisions in
the PHP core.

Plus, remember ext/intl is maintained in PECL too, where it supports
PHP 5.2.

Anyway, this is getting a bit off-topic.

--
Gustavo Lopes

13 years ago by Stas Malyshev — view source — reply

unread

Hi!

I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped
short of adding a procedural interface. I think there's a larger
expectation of a having an OOP interface when working with iterators.
What do you think? If there's no procedural interface, I'll change the
instances of zend_parse_methods to zpp for performance.

Nice! I remember we had TextIterator in PHP 6, IIRC that was the reason
BreakIterator never found its way into intl.

BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return

One thing I notice here is that with this API it is not possible to
programmatically choose what is the iteration unit - you'd have to do a
switch for that. Do you think it may be a good idea to have a generic
function that allows to choose the unit programmatically?

What is the notion of characters - is it grapheme characters? Is there
option to iterate over code points too - not sure if it's useful just
curious, as we used to have it in PHP 6 IIRC.

About getAvailableLocales() - what this actually does? Does it list all
avaliable locales in the system, ones that have BreakIterator rules, or
something else? If it's not related to BI, I'm not sure we need to have
it in BI. What is the intended usage of it? Maybe it should be part of
Locale class?

Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for

Doesn't it have a notion of current position? If so, key should be the
current position.

Will this BreakIterator be usable in foreach? I'm not sure I understand
it from this description - understanding this without any usage
examples, RFCs or code snippets for intended usage is really hard and I
think we should really start with doing that. I would expect this class
to work like this:

foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
echo "Word number $i is $word\n";
}

or at least like this:

foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
echo "Next word at position $i is: $word\n";
}

Is it the model? If not, I think we need to wrap the C API to make this
possible, because this is what people expect in PHP from the iterator.

Finally, I added a convenience method to BreakIterator:
getPartsIterator(). This provides an IntlIterator, backed
by the BreakIterator PHP object (i.e. moving the pointer or
changing the text in BreakIterator affects the iterator
and also moving the iterator affects the backing BreakIterator),
which allows traversing the text between each boundary.

How that text is being traversed - by code
points/characters/graphemes/bytes?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Gustavo Lopes — view source — reply

unread

BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return

One thing I notice here is that with this API it is not possible to
programmatically choose what is the iteration unit - you'd have to do
a
switch for that. Do you think it may be a good idea to have a generic
function that allows to choose the unit programmatically?

You can create a RuleBasedBreakIterator with any rules you choose. The
rules are basically a set of regex expressions; ICU has two matching
modes -- by default it tries the longest match, but it can also chain
together rules. There are rules to advance, to go back and to go to a
safe position from an arbitrary position in the two directions. The ICU
user guide to which I linked in the first e-mail has more details.

What is the notion of characters - is it grapheme characters? Is
there
option to iterate over code points too - not sure if it's useful just
curious, as we used to have it in PHP 6 IIRC.

Yes, they are grapheme clusters. ICU has a special rule for Thai, but
from I see in the tracker, it's obsolete with recent versions of Unicode
(possibly the root rule is now generic enough).

To iterate over code points, you can build a very simple
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
example here: https://gist.github.com/2843005

About getAvailableLocales() - what this actually does? Does it list
all
avaliable locales in the system, ones that have BreakIterator rules,
or
something else? If it's not related to BI, I'm not sure we need to
have
it in BI. What is the intended usage of it? Maybe it should be part
of
Locale class?

Right now, the ICU implementation just calls
Locale::getAvailableLocales(), but its description is "Gets all the
available locales that has localized text boundary data." so I suppose
it could return a different set in the future.

Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for

Doesn't it have a notion of current position? If so, key should be
the
current position.

Will this BreakIterator be usable in foreach? I'm not sure I
understand
it from this description - understanding this without any usage
examples, RFCs or code snippets for intended usage is really hard and
I
think we should really start with doing that. I would expect this
class
to work like this:

foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
echo "Word number $i is $word\n";
}

or at least like this:

foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
echo "Next word at position $i is: $word\n";
}

Is it the model? If not, I think we need to wrap the C API to make
this
possible, because this is what people expect in PHP from the
iterator.

My options here were: the BreakIterator mirrors the ICU homonym -- it
iterates over breaks, i.e., boundaries in the text. Hence, the iterators
returns the positions of the several boundaries. Therefore, this
cannot be used also for the key.

Acknowledging that getting the text between the boundaries was going to
be a common scenario, I added a method, getPartsIterator(), that yields
the text between each boundary. Hence, there is one less element in this
iterator than in the BreakIterator.

Neither of the iterators implement getKey(), so one traversing the keys
will be 0, 1, 2... It would probably be a good a idea to change the
parts iterator to give the left boundary as the key. That way on could
do:

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
foreach ($bi->getPartsIterator() as $k => $v) {
echo "$v is at position $k\n";
}

instead of

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
$pos = $bi->first();
foreach ($bi->getPartsIterator() as $v) {
echo "$v is at position $pos\n";
$pos = $bi->current();
}

Another possibility would be to have the break iterator itself behave
as the parts iterator for iteration purposes. I don't think that is a
good idea. Even though BreakIterator does not implement Iterator, people
would expect next() and current() return the next and current iterator
value, while they would be returning the iteration key.

By the way, you can look at the test cases in the tree on github for
examples:
https://github.com/cataphract/php-src/commit/d289c3977ed4ba8d9ba127e5af9f709b19b8e1ba

Thanks for the comments!

--
Gustavo Lopes

13 years ago by Stas Malyshev — view source — reply

unread

Hi!

You can create a RuleBasedBreakIterator with any rules you choose. The

I understand that, but I have no idea how to write proper rules for word
boundaries, I just want to tell it "give me word boundaries" but not by
saying createWordBoundaries() but by doing createIterator($type) where
$type == WORD_BOUNDARIES.

To iterate over code points, you can build a very simple
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
example here: https://gist.github.com/2843005

Is there any reason not to provide this as a service for PHP user? I
understand somebody who is a specialist in ICU knows that already, but
most PHP users don't know this magic.

Right now, the ICU implementation just calls
Locale::getAvailableLocales(), but its description is "Gets all the
available locales that has localized text boundary data." so I suppose
it could return a different set in the future.

My only concern is that no other classes have getAvailableLocales() and
it doesn't seem to do anything useful now, so maybe we should omit it
for now?

Acknowledging that getting the text between the boundaries was going to
be a common scenario, I added a method, getPartsIterator(), that yields
the text between each boundary. Hence, there is one less element in this
iterator than in the BreakIterator.

Neither of the iterators implement getKey(), so one traversing the keys
will be 0, 1, 2... It would probably be a good a idea to change the
parts iterator to give the left boundary as the key. That way on could
do:

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
foreach ($bi->getPartsIterator() as $k => $v) {
echo "$v is at position $k\n";
}

Another thing I notice here: why not make:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);

into:
$bi = BreakIterator::createWordInstance(NULL, $foo);

This provides for less boilerplate code, since if you are creating
iterator chances are you have some string to iterate over already.

Another possibility would be to have the break iterator itself behave
as the parts iterator for iteration purposes. I don't think that is a
good idea. Even though BreakIterator does not implement Iterator, people
would expect next() and current() return the next and current iterator
value, while they would be returning the iteration key.

OK, if you have to do getPartsIterator() it's fine as long as you can
easily do foreach on it, since that's what one expects from iterator.
I'd also add some flag that would skip or not skip whitespace, if this
is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar']
and sometimes you want just ['foo', 'bar'] - does ICU support it somehow?

Again, having some full description of proposed API would be nice.
For example, what hashCode() does?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Gustavo Lopes — view source — reply

unread

On Mon, 04 Jun 2012 21:09:28 +0200, Stas Malyshev smalyshev@sugarcrm.com
wrote:

I understand that, but I have no idea how to write proper rules for word
boundaries, I just want to tell it "give me word boundaries" but not by
saying createWordBoundaries() but by doing createIterator($type) where
$type == WORD_BOUNDARIES.

Why? This makes no sense to me. Why would createIterator(WORD_BOUNDARIES)
be better than BreakIterator::createWordInstance()? Especially in a
dynamic language like PHP where you can do:

$type = 'word';
$bi = BreakIterator::{"create" . $type . 'instance'}(NULL);

To iterate over code points, you can build a very simple
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
example here: https://gist.github.com/2843005

Is there any reason not to provide this as a service for PHP user? I
understand somebody who is a specialist in ICU knows that already, but
most PHP users don't know this magic.

Well, the reason I didn't add it is because ICU didn't add such an
iterator. I imagine the reason for that is that there are much more
efficient ways to iterate over UTF-8 that don't involve a full-blown regex
based text segmentation engine. In fact, ICU provides very efficient ways
(with macros and simple specialized functions) to iterate over UTF-8 text
in utf8.h:

http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/common/unicode/utf8.h

Right now, the ICU implementation just calls
Locale::getAvailableLocales(), but its description is "Gets all the
available locales that has localized text boundary data." so I suppose
it could return a different set in the future.

My only concern is that no other classes have getAvailableLocales() and
it doesn't seem to do anything useful now, so maybe we should omit it
for now?

I have no special love for it, but your statement is innacurate in one
aspect -- I've added a similar function in IntlCalendar... whose
implementation is basically the same:

http://lxr.php.net/xref/THIRD_PARTY/ICU4C/source/i18n/calendar.cpp#getAvailableLocales

I don't mind removing both though.

Another thing I notice here: why not make:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);

into:
$bi = BreakIterator::createWordInstance(NULL, $foo);

Two reasons:

it encourages bad behavior, namely not reusing the BreakIterator objects.
that's not the ICU signature. If ICU in the future adds overloads with a
string in the second argument, we'll find ourselves with odd signatures.

OK, if you have to do getPartsIterator() it's fine as long as you can
easily do foreach on it, since that's what one expects from iterator.
I'd also add some flag that would skip or not skip whitespace, if this
is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar']
and sometimes you want just ['foo', 'bar'] - does ICU support it somehow?

The BreakIterator cannot throws away text. You have to look at the rules
statuses. Example:

$text = 'This is a phrase... with some punctuation.';
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($text);
foreach ($bi->getPartsIterator() as $v) {
if ($bi->getRuleStatus() > BreakIterator::WORD_NONE_LIMIT)
var_dump($v);
}

string(4) "This"
string(2) "is"
string(1) "a"
string(6) "phrase"
string(4) "with"
string(4) "some"
string(11) "punctuation"

Again, having some full description of proposed API would be nice.
For example, what hashCode() does?

The ICU docs only say "Compute a hash code for this BreakIterator." If I'm
not mistaken from my quick glance at the source, it just returns the
length of the forward rules.

--
Gustavo Lopes

13 years ago by Stas Malyshev — view source — reply

unread

Hi!

I understand that, but I have no idea how to write proper rules for word
boundaries, I just want to tell it "give me word boundaries" but not by
saying createWordBoundaries() but by doing createIterator($type) where
$type == WORD_BOUNDARIES.

Why? This makes no sense to me. Why would createIterator(WORD_BOUNDARIES)
be better than BreakIterator::createWordInstance()? Especially in a

Libraries, for example. Say I want to make a widget that allows the user
to display text wrapped, and give him option to wrap on words or
sentences or just on any character. I need underlying library that wraps
properly based on some value in config.

However, looking at the ICU API, I see the programmatic creation of
instances are all private, so there's no API access to that as far as I
can see. So I guess this one won't work out.

I have no special love for it, but your statement is innacurate in one
aspect -- I've added a similar function in IntlCalendar... whose
implementation is basically the same:

Same goes for that one, then. We need to make the API consistent - if
those are useful, let's have them everywhere, if not - let's leave them
for Locale.

Another thing I notice here: why not make:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);

into:
$bi = BreakIterator::createWordInstance(NULL, $foo);

Two reasons:

it encourages bad behavior, namely not reusing the BreakIterator objects.

that's not the ICU signature. If ICU in the future adds overloads with a
string in the second argument, we'll find ourselves with odd signatures.

99% of cases BreakIterator object will not be reused anyway, since the
code will be dealing with one text, doing its thing over it and the
forgetting about it. Of course, you can have bigger frameworks and
optimizations - the text parameter is optional, there just to capture
the most common case and avoid boilerplate code.

I think we need to think bigger than copying ICU signatures one to one.
PHP is not C and not Java, why PHP users should follow to the point what
C or Java API users do? PHP is no longer a tiny wrapper over C, most PHP
users never touched C and don't want to parse through ICU C docs to
figure out how stuff works. We need to make it one-stop shop.

The BreakIterator cannot throws away text. You have to look at the rules
statuses. Example:

$text = 'This is a phrase... with some punctuation.';
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($text);
foreach ($bi->getPartsIterator() as $v) {
if ($bi->getRuleStatus() > BreakIterator::WORD_NONE_LIMIT)
var_dump($v);
}

Could we have internal status in PartsIterator object that would
abstract out such things and provide some API for common cases? Again,
right now description of these APIs is sorely missed - I for example
have no idea what they actually can do - e.g. what getRuleStatus() would
do?

The ICU docs only say "Compute a hash code for this BreakIterator." If I'm
not mistaken from my quick glance at the source, it just returns the
length of the forward rules.

Why we need this function? What will be the use of it for a PHP user?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

13 years ago by Gustavo Lopes — view source — reply

unread

I've made several changes to accommodate several issues raised on this
list. See

https://github.com/cataphract/php-src/compare/break_iterator

All but the first two commits are new. Please see the test cases for
doubts concerning usage.

--
Gustavo Lopes

Cheers,

Cheers,

Cheers,

Again, having some full description of proposed API would be nice. For example, what hashCode() does?

Again, having some full description of proposed API would be nice.
For example, what hashCode() does?