A partial archive of discourse.wicg.io as of Saturday February 24, 2024.

MHTML Generation and Loading as Implemented in Chrome

jianli
2017-10-04

In Chrome, the static snapshots of web pages are captured and saved in a modified MHTML format (RFC 2557), that is a web page archive format used to combine resources together with HTML into a single file. A set of modifications, as summarized in this doc, are made to the existing MHTML generator and loader in order to improve security and privacy and ensure the snapshot mimics as close as possible to the original page.

There are ongoing efforts to try to add better support for creating packages of files for use on the web, as drafted in the Packaging on the Web spec. But it is a long way to get all things sorted out and agreed upon. So Chrome builds its own support upon the existing MHTML spec with improvement described in this document.

MT
2017-10-05

What’s the benefit of MHTML over regular HTML with external resources embedded as standard Data URIs?

jianli
2017-10-05

There’re a lot of benefits from saving as MHTML:

  • MHTML is self-contained while regular HTML does not have such constraint.
  • Certain browser version may impose size limit for data URIs. Also data URIs are base64 encoded, which is not efficient.
  • It may be hard to try encapsulate iframe of iframe using data URIs.
  • In Chrome, MHTML is loaded in sandboxing mode and other constraints while loading regular HTML file does not.
MT
2017-10-05

MHTML is self-contained while regular HTML does not have such constraint.

What exactly do you mean?

Certain browser version may impose size limit for data URIs.

Apart from the ancient IE8’s 32KB limit, are you aware of such limitation in specific current browsers?

data URIs are base64 encoded, which is not efficient.

Afaict, Data URIs (base64) are used in MHTML as well, e.g. for images.

It may be hard to try encapsulate iframe of iframe using data URIs.

What exactly do you mean?

In Chrome, MHTML is loaded in sandboxing mode and other constraints while loading regular HTML file does not.

This could probably be solved (if needed in the first place) by using a special meta element inside HTML to inform browser that this specific HTML document should be viewed in sandboxing mode.

jianli
2017-10-07

The MHTML is a web page archive format used to combine resources together with HTML into a single file. All the resources needed to render this page should be contained in the MHTML file. If it is not found, it should not be loaded from cache or network.

The binary content-transfer encoding can be used for each included resource. This is what we currently do for MHTML saving on Chrome on Android. We’re considering to also do this across all other platforms.

If the original page contains multiple nested iframes, it will be a bit of work to try to construct all data URIs correctly. It is also a great pain if we want to debug what’s going on.

justsomeguy
2017-10-11

I would think you’d need to keep elements like <p hidden="">, input, object, video, and script. Those elements can be used for sibling and :empty pseudo-class selector matching. Attributes such as href, onclick, ping, and src can likewise used for selector matching. Likewise, the absence of those attributes can trigger style changes.

And while everything inside of <p hidden=""> can be safely discarded per the semantics described in the HTML spec, I can foresee people using hacks like pairing <html hidden=""> with display: block, <body><object>Hello, world!</object></body>, or <body><a ping=""/> with body > a:not([ping]) ~ * { display: none; } to prevent their page content being saved with this feature.

Also, it’s sad there’s no audio or video support in the proposal. It’s nice to see some rare interest in MHTML though. I used to save pages with it when I still used Opera as a web browser.

Siemenskun
2017-12-22

Hello, I’m developer that uses MTHML format for my own purposes for years and I have something to say about the implementation for header Snapshot-Content-Location. I think you can just use regular Content-Location header in the heading of a multipart/related.

Opera on Presto engine adds the header Conent-Location in the main MHTML header. Many years ago - greather than ten for sure - only IE and Opera could make MHTML files and that was the industry standart. So I think MHTML readers can handle Content-Location header in the main header.

About https://tools.ietf.org/html/rfc2557#section-4.3

If a Content-Location header field is used in the heading of a multipart/related, this Content-Location SHOULD apply to the whole aggregate, not to its root part.

I think it can mean, for example, we can have Content-Location: http://example.com/page/subpage, but the root header can be http://example.com/SPAengine.html if we deal with some kind of Single-Page Application that use HTML5 History API. (Yes, I know that Blink doesn’t allow Javascript for MHTML pages, but other browsers do.) And some simple example: Content-Location in the MHTML header is http://example.com/article.html#paragraph5 but the root header just is http://example.com/article.html. By the way, others browsers (I’ve checked IE 11, Opera 12) strips URI #anchor for root Content-Location and it seems logical (Opera strips it for the header in the top headers too and it’s not so logical), but Blink don’t do that.

Sorry for my bad English.