Class CompositeBreakIterator
java.lang.Object
org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator
An internal BreakIterator for multilingual text, following recommendations from: UAX #29: Unicode
Text Segmentation. (http://unicode.org/reports/tr29/)
See http://unicode.org/reports/tr29/#Tailoring for the motivation of this design.
Text is first divided into script boundaries. The processing is then delegated to the appropriate break iterator for that specific script.
This break iterator also allows you to retrieve the ISO 15924 script code associated with a piece of text.
See also UAX #29, UTR #24
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final ICUTokenizerConfig
private BreakIteratorWrapper
private final ScriptIterator
private char[]
private final BreakIteratorWrapper[]
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) int
current()
Retrieve the current break position.private BreakIteratorWrapper
getBreakIterator
(int scriptCode) (package private) int
Retrieve the rule status code (token type) from the underlying break iterator(package private) int
Retrieve the UScript script code for the current token.(package private) int
next()
Retrieve the next break position.(package private) void
setText
(char[] text, int start, int length) Set a new region of text to be examined by this iterator
-
Field Details
-
config
-
wordBreakers
-
rbbi
-
scriptIterator
-
text
private char[] text
-
-
Constructor Details
-
CompositeBreakIterator
CompositeBreakIterator(ICUTokenizerConfig config)
-
-
Method Details
-
next
int next()Retrieve the next break position. If the RBBI range is exhausted within the script boundary, examine the next script boundary.- Returns:
- the next break position or BreakIterator.DONE
-
current
int current()Retrieve the current break position.- Returns:
- the current break position or BreakIterator.DONE
-
getRuleStatus
int getRuleStatus()Retrieve the rule status code (token type) from the underlying break iterator- Returns:
- rule status code (see RuleBasedBreakIterator constants)
-
getScriptCode
int getScriptCode()Retrieve the UScript script code for the current token. This code can be decoded with UScript into a name or ISO 15924 code.- Returns:
- UScript script code for the current token.
-
setText
void setText(char[] text, int start, int length) Set a new region of text to be examined by this iterator- Parameters:
text
- buffer of textstart
- offset into bufferlength
- maximum length to examine
-
getBreakIterator
-