Security model for inheritance of character encoding properties
Draft 2
2007-03-02 / peter@opera.com


* Description of the problem

It is common for web pages to not include a description on what character
encoding ("charset" in MIME terminology) is used on a page, and thus the
browser must employ various methods of detecting the encoding of the page
automatically.

One of the methods employed is to allow a parent document to influence
documents and resources contained within or linked from it. This can be done
in various ways:

- The CHARSET attribute on the LINK, A and SCRIPT elements.

- Inheriting the encoding of the parent document into a FRAME or
  IFRAME element.

Allowing one document to influence the character encoding of an unrelated
resource may be an attack vector of XSS attacks, such as the one described
in bug 246974, and below under "Attack vectors".


* Attack vectors

Take two documents, [1] residing on www.evil-phishing.com and entirely under
the control of an Evil Party<tm>, and [2] residing on www.famous-blog.com,
which allows anyone to leave comments, which are sanitized so that any HTML
code is removed. Document [2] does not declare a charset in the header, nor
does it contain any META tag declaring one.

The party controlling [1] posts a comment on [2] with the text
"+ADw-script+AD4alert(document.cookie);+ADw-/script+AD4". This text passes
the sanitizer and is displayed as-is on the comment page.

The page on [1] is now set up to be served as UTF-7, and adds the document
in an IFRAME. If charset inheritance is allowed, any visitor to [1] will see
[2] and the page is now interpreted as containing the sequence
"<script>alert(document.cookie);</script>".

This is the vector decsribed in bug 246947.

Similar attack vectors can possibly be created for other 7-bit character
encodings, such as any of the ISO 2022-based ones or HZ-GB2312.


* Why allow inheritance

** Example one

Take two documents. [1] residing on www.famous-portal.jp and [2] residing on
[2] www.american-adserver.com. Neither declare the character encoding of the
page, but [1] is displayed properly in all current browsers due to the
various workarounds that are employed because they have to work with pages
that do not declare an encoding.

The page [1] now includes an IFRAME or a SCRIPT from [2] to display
advertisements. Since [2] does not declare an encoding, to make sure it is
displayed properly, it must either be autodetected or the encoding be
inherited from [1]. Since [2] usually only contains a very short document or
script, autodetecting often fails.

To display the advertisment IFRAME or SCRIPT correctly, it needs to inherit
the encoding.

This is the problem described in, among others, bug 63807, 185465, 213713.

** Example two

Take one document residing on www.russian-expatriate.com, hosted by an
American web hosting firm. The webmaster of it only has control over the
documents it hosts, not the HTTP headers. To make sure the Russian text is
displayed properly he declares the encoding in a META tag, Opera will not
employ Russian autodetection automatically for the .com domain.

To declare dynamic menus, he puts the strings describing them in a script,
linked from the page. Since he cannot control the HTTP header, he adds a
CHARSET attribute to the script link. The script must honour this charset.

This is the problem described in bug 53747.

He later adds an advertisment script from the [2] page from the previous
example. To make sure the Russian texts are displayed properly, he adds the
CHARSET attribute there as well.

This is the problem described in bug 210236.


* Security policy

The only true solution is to force pages into declaring their encoding. This
will never happen.

To make sure we are not exploitable for cross-domain attacks, we define a
security policy (DOC_SET_PREFERRED_CHARSET) that performs a simple origin
check and only allows inheritance, either implicit or through the CHARSET
attribute, if the protocol, domain and port is the same (this is the same as
the DOM origin check). In addition, we do not allow UTF-7 to be set on a
document.

This is the same approach as taken by Mozilla Foundation in their security
advisory 2007-02. This will break ads served from Asian pages (bugs 119501,
185465, 210236, among others) but will allow local frames (bug 213713).


* Proposed workarounds

A possible fix for the cross-domain issue might be to enable autodetection
for pages inside a frame as if they were served from the top page's domain,
so that any IFRAME, FRAME or SCRIPT on a page served from .jp would have its
autodetection mode set to Japanese, no matter what domain it itself might be
served from.

Another one would be to inherit an autodetection mode based on the parent
document's encoding, so that any IFRAME, FRAME or SCRIPT on a page using a
Japanese character encoding (either explicitly defined or autodetected)
would inherit only the Autodetect Japanese mode, not the actual encoding.


* Other possible measures

UTF-7 autodetection should probably be dropped for HTML documents, as it can
be dangerous. Research is needed to see if there are any actual pages using
UTF-7 on the web, and that do not explicitly say so.


* References

** Opera bugs

"Opera 9 - (i)Frames cross domain charset inheritance"
https://bugs.opera.com/show_bug.cgi?id=246974

"Linked IFRAMEs do not inherit encoding"
https://bugs.opera.com/show_bug.cgi?id=63807

"Displaying in Korean page"
https://bugs.opera.com/show_bug.cgi?id=185465

"Character setting in frameset is not honored for framed page. Wrong
encoding type is used"
https://bugs.opera.com/show_bug.cgi?id=213713

"Missing attr "Charset" in tag "Script""
https://bugs.opera.com/show_bug.cgi?id=53747

"Encoding not inherited properly for autodetected pages"
https://bugs.opera.com/show_bug.cgi?id=119501

"Encoding problems with javascript included content"
https://bugs.opera.com/show_bug.cgi?id=203829

"CHARSET attribute on SCRIPT tag not supported across domains"
https://bugs.opera.com/show_bug.cgi?id=210236

** Other references

Mozilla bug: "(i)Frames that do not specify a charset inherit it from parent
across sites"
https://bugzilla.mozilla.org/show_bug.cgi?id=356280

"Mozilla Foundation Security Advisory 2007-02: Child frame character set
inheritance"
http://www.mozilla.org/security/announce/2007/mfsa2007-02.html


