Copyright © 1995-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.
This document describes the internals of the text segmentation implementation (Unicode Standard Annex #29 — Text Segmentation).
Grapheme cluster boundary detection does not require a state machine, so there is a simple API that takes two characters and reports if there is a boundary between them. This boundary detection is implemented by going through the various GBx rules, in succession, and returning the result from the first matching rule.
Word boundary detection is implemented using a state machine, which is described in the below diagram. Please note that due to the use of a state machine with one character look-ahead, the two rules which require a two-character look-ahead (WB6 and WB12) are not handled correctly.
Green edges denote that a break exists, red that one does not. If there is no specific transition from a state, the transition from the unlabelled state is used.
State machine for word boundary detection.
Sentence boundary detection is implemented using a state machine, which is described in the below diagram.
Green edges denote that a break exists, red that one does not. If there is no specific transition from a state, the transition from the any state is used. The any' state is special, transitions to it are treated as transitions from the any state, except that a break is reported.