Tailoring to Unicode algorithms

Copyright © 1995-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.

Opera-specific tailoring

Some of the Unicode algorithms have been tailored to suit Opera and the web. This document lists the current tailoring.

Character properties

No tailoring have been performed.

Unicode Transformation Formats

No tailoring have been performed.

Unicode line-breaking algorithm (UAX#14)

LB13
CORE-100: Allowing line break between alphanumeric or infix numeric separator followed by space and infix numeric separator
Inserts new rule (AL | IS) SP+ ÷ IS before (AL | IS) × IS.
LB24
CORE-32455: Preventing line breaks inside prefixes
Inserts new rule PR % PR before PR ^ PR.
LB30
CORE-5163: Allowing line break before opening parenthesis.
Inserts new rule AL ÷ OP before (AL | NU) × OP.
CJ: Conditional Japanese Starter: "This character class contains Japanese small hiragana and katakana. Characters of this class may be treated as either NS or ID."
CJ character class behaves like NS character class.

Unicode normalization (UAX#15)

No tailoring have been performed.

Unicode text boundary algorithm (UAX#29)

General

The Extend class is currently not considered to include checks for the Other_Grapheme_Extend and Grapheme_Extend classes. This tailoring/unimplemented feature is being considered lifting.

Specific to grapheme-cluster detection

GB1
An initial NUL (start-of-text marker) will not report a text boundary if the second character is in the Extend class.
This is to avoid users of this algorithm to pass a combining mark as a base character to other parts of the lookup algorithm.

Specific to word detection

WB11

U+003B semicolon (;), U+FE14 presentation form for vertical semicolon () and U+FF1B fullwidth semicolon () are changed from the MidNum class to Other.
This is to allow semicolon-separated records of words ending with digits or digit groups to be parsed properly.

U+002E full stop (.) is changed from the MidNumLet class to MidNumLet.
This is done for backwards compatibility with word segmentation done with previous versions of this module. It does mean that abbreviations on the form "U.S.A." are not properly segmented.

Specific to sentence detection

No tailoring have been performed.