Bidi algorithm

Copyright © 1995-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.

High level specification

Read this chapter if you are interested in the rationale behind the bidi algorithm, and have asked yourself - why is it so complicated?

Introduction to the bidi algorithm - why and how?

Bidi concerns mixing right-to-left sentences with western left-to-right sentences. If you study the algorithm it can seem very theoretical, so this chapter will try to explain it on a higher level.

The inherent direction of a word is based on which character is used. This information is stored in the encodings.bin file. For example 'a' has an inherent direction of left-to-right, while Hebrew 'aleph', א, has an inherent direction of right-to-left. This inherent direction is what is most commonly used in the bidi algorithm.

When describing the bidi algorithm i use small caps 'a' for inherent left to right characters, and large caps 'A' for inherent right to left characters. This is a common convention.

Line versus paragraph

First consider this sentence (in right to left characters). "I HAVE A CAR". The correct output of this sentence for a person reading from right to left would be "RAC A EVAH I".

But if the line was short enough to only fit two words per line, this would be the correct output from a right to left (but still top to bottom!) reading person

HAVE I
RAC A

This of course means that we can not first turn everything around and then do the line breaking (see example below) - the sorting has to be done after line breaking.

Erroneous output!!! Yoda would have liked this - spells out "a car I have"
RAC A
EVAH I

But sorting is based on context as we will see later, so this comes to the conclusion: "level calculation is done on a per-paragraph level, but reordering is done on a per-line basis".

Paragraph level

Consider these two sentences with the same structure of inherent direction

GREEN APPLE and FANTASTIC CAPPUCCINO shisha tobacco

I HAVE A mercedes convertible CAR AND A ford station wagon.

One sentence is written in rtl context and the other in ltr context, the visual order of these two sentences would be respectively

ELPPA NEERG and ONICCUPPAC CITSATNAF shisha tobacco.

ford station wagon A DNA RAC mercedes convertible A EVAH I.

As you can see these two sentences are ordered differently. The first has a ltr (western) context which yields a paragraph level of 0, while the other has an Arabic context which yields a paragraph level of 1.

Level calculation

So how is this all done? How is the difference between the first sentence and the second sentence represented?

Before any sorting is done, each stretch of the same type characters (including neutral characters like space and ! etc) is given a level. The level basically tells how many times a stretch should be turned around. An even level means this is a left-to-right stretch, while an odd level means it is a right-to-left stretch.

A base direction is calculated for each paragraph (either 0 or 1) see above. The base direction may also be changed by increasing the embedding level, which will be explained below.

If a character of the same type as the base direction is encountered, the character gets the same level as the base direction. If a character of the other direction is encountered, it is given a level that is one step higher than the base direction (ex base = 1, 'A' is given level 2)

Embedding

By using embedding marks, the unicode signs "RLE"/"LRE" or the css rule "unicode-bidi:embed" or the "dir" html-attribute, the author can mark the text as being a sub-sentence (i.e. a quote). An embedding is terminated by the unicode sign "PDF" or by leaving the tag that contained the unicode-bidi declaration.

Embedding levels is an integral part of the bidi algorithm even though the examples very often look very silly.

example sentence           did you say      "A WHITE mercedes CAR"      in arabic?
levels without embedding   000000000000     0111111100000000001110     00000000000
resolved without embedding did you say (RLE)"ETIHW A mercedes RAC"(PDF) in arabic?
levels with embedding      000000000000     1111111122222222221111     00000000000
resolved with embedding    did you say      "RAC mercedes WHITE A"      in arabic?

There can be a maximum of 63 embedding levels, and I challenge any reader to construct a sentence that contains that many levels (or even three levels of embedding).

Overrides

Overrides (unicode character "RLO/LRO", css rule "unicode-bidi:override;" or html tag <BDO> ) can be used to explicitly control the direction of a text. The direction should be taken care of by the built in functionality of the bidi algorithm, so overrides should very rarely be necessary. You would want to override the direction of arabic right to left text just as often as you would want to write an English sentence backwards.

Reordering

When the levels are calculated, the segments are split up into lines, using the ordinary line-breaking algorithm. What you have then is a line, with the information of what bidi-level each character (or stretch) has.

The reordering is now done starting with changing the direction of the highest level characters, then proceeding with changing the next-to highest level characters including any higher level characters. This way every stretch with an even number will appear the same order as they started with (left-to-right), while every stretch with an odd number will have a right-to-left direction. (Reordering specification).

Weak types

There is a class of characters called weak characters. They are typically mathematical expressions such as '+', '/', '-', ':'. These character follow very intricate rules, listed as the weak type rules in the specification.

Description of the implementation

The main class containing the level calculation engine is the BidiCalculation class.

Our implementation does the level calculation in only one iteration, using a state machine with two levels. This adds some complexity to the implementation, but gives us a lot of benefits when it comes to incremental rendering as well as performance.

Initialization of the BidiCalculation is done by giving the object an empty list of ParagraphBidiSegments to fill up.

The main API to the bidi machine is BidiCalculation::AppendStretch(). This function is used to feed the state-machine with new bidi stretches of a certain width and direction.

The calculation class gradually fills up the ParagraphBidiSegment stack given to the engine in the initialization.

Typically, one BidiCalculation object is instatiated for each text paragraph, but the machine can be thrown away when the calculation is done

The actual reordering is not done in this class, but is up to the user of the ParagraphBidiSegment stack, after line breaking is applied.

Memory usage

The output from the BidiCalculation object is a list of ParagraphBidiSegments allocated on the heap. The general rule is that there is one ParagraphBidiSegment for each stretch of text of equal level. (But because of implementation details a stretch of similar level may be represented by more segments.)

The ParagraphBidiSegment class inherits from the Link class and also contains information about how wide the segment is, where it starts on the virtual line, it's calculated bidi-level and a pointer to the HTML_Elements this specific segments starts (used in the layout engine).

The total size of a ParagraphBidiSegment is approximately (depending on system) 32 bytes.

OOM handling

The bidi calculation object returns OpStatus::ERR_NO_MEMORY from BidiCalculation::AppendStretch, if this stretch requires a new segment and the allocation failed. Other functions in the BidiCalculation class that may require segment allocation can also return OpStatus::ERR_NO_MEMORY.

References