Detecting paragraph rectangles in a document

Table of Contents

About This Document

Disclaimer

The material, including but not limited to all software, design, drawings, technical specifications and other confidential information sent to you ("Material"), is the exclusive property of Opera Software ASA.
The Material is classified as strictly confidential information and is internationally protected by copyright-, trademark- and other such laws. The Material is sent to you for internal use only and shall only be used as expressly instructed by Opera Software.

Any copying, reproduction, modification or distribution not in accordance with a special written license agreement with Opera Software is expressly prohibited, and may result in severe civil and criminal penalties. Opera Software actively and aggressively enforces its intellectual property rights to the fullest extent of the law.

Changelog

Date Version Status Changes & comments By
20090415 0.1 Draft First version Magnus Gasslander
20090616 0.2 Draft Updated after input from Öyvinds Magnus Gasslander
20120217 0.3 Draft Increased TCTO_LINE_DIFF_THRESHOLD and TCTO_LINE_DIFF_VERTICAL_LIMIT values Christian Kindahl

Detecting paragraph rectangles

Purpose

The purpose of this document is to decribe the behaviour when detecting paragraph rectangles in a document. The paragraph rectangles can be used as hints when scrolling with gravity and adaptive zoom

Introduction

This documentation is based on detecting paragraph rectangles using the TextContainerTraversalObject.

Patch bug CORE-18119 also contains an implementation for detecting interesting areas based on the ZoomTraversalObject, inspired by the AreaOfInterest callback. This provides similar functionality.

Paragraph rectangles are avaliable through the OpViewPortController::GetParagraphRects API.

Constants

The following constants are used.

Name Value
TCTO_LINE_DIFF_THRESHOLD160
TCTO_LINE_DIFF_VERTICAL_LIMIT160
TCTO_IMPORTANT_CONTENT_HORIZONTAL_THRESHOLD60
TCTO_IMPORTANT_CONTENT_VERTICAL_THRESHOLD60
TCTO_TITLE_HORIZONTAL_THRESHOLD40
TCTO_TITLE_VERTICAL_THRESHOLD20

Definitions

Pending block
When any text content or replaced content that is not suitable for a paragraph rect by itself, but may be a part of a paragraph, is encoutered it is added to a pending rectangle. This rectangle is commited as a paragraph rectangle, or discarded whenever a block or table cell is ended.

Specification

A paragraph rectangle is a rectangle including content that is deemed to be a piece of standalone content in the page. It may contain replaced content or a paragraph of text.

A paragraph rectangle will never span several blocks or table cells.

The TextContainerTraversalObject will create paragraph rectangles around the following content, in order of priority.

Form content

Any form content will create a paragraph rect. Form content are <input> (not hidden or image), <textarea>, <button>, <option> and <select>.

List items

All list items will create a paragraph rect. If the list item has a non-none list-style-type, the area will include the bullet.

Other replaced content

All replaced content with height > TCTO_IMPORTANT_CONTENT_VERTICAL_THRESHOLD and width > TCTO_IMPORTANT_CONTENT_HORIZONTAL_THRESHOLD will create a paragraph rect. Heights and widths are including border.

Large text

A text block that has height > TCTO_TITLE_VERTICAL_THRESHOLD and width > TCTO_TITLE_HORIZONTAL_THRESHOLD will create a paragraph rect regardless of number of words in the block.

Block of text

A block or table cell with text with more than three words will create a paragraph rect when the block or table cell is ended. FIXME - this needs an update of the rule to support for example CJK languages.

Text around floats

If the starting point of a line differs with more than TCTO_LINE_DIFF_THRESHOLD from the current leftmost x-coordinate of the pending block and current pending text block height is larger than TCTO_LINE_DIFF_VERTICAL_LIMIT, the pending block will create a paragraph rect and a new rect will be started. The idea of this rule is to capture floats smaller than a certain threshold in a single rectangle with the surrounding text. Text wrapping larger floats will create several rectangles.

Potential improvements