Alternative to WebPackages to gurantee integrity (Document Hash)


#1

Google has been promoting AMP pages. since it’s existence. It has been marketed much and developers are encouraged to develop AMP pages. This however comes with a caveat, that the url of the page is not correct in the browser and that internet gets more centralized.

Google presented the community with “WebPackages” to fix the problem. However, we believe that it is not the right solution for the problem.

The reason Google hosts AMP it’s self is to guarantee that it follows the AMP standards and that website does not serve different code to useragent based on wither it’s Google bot or a real user.

Document Hash comes with the guarantees of integrity of WebPackages but the ability for the author’s server to distribute it’s self and integrity guarantees to the referrer. Document Hash is the SHA256 hash of the dom elements of the document.

Higher level overview of how it works:

Inside the <a> element a sha256 hash attribute is added. example <a hash="SHA256:d04b98f48e8f8bcc15c6ae5ac050801cd6dcfd428fb5f9e65c4e16e7807340fa">

This means that the page that will open once <a> element is clicked is guaranteed to have the same document hash. Once the user clicks on another link on the document, the hash limits are removed.

Every hash attribute starts with the name of the hashing algorithm.

How to deal with when hashes don’t match? An fail attribute to <a> can be added. The fail attribute will have three properties.

  1. return
  2. ignore
  3. inform

example 1 and behavior. <a href="http://example.com" hash="SHA256:d04b98f48e8f8bcc15c6ae5ac050801cd6dcfd428fb5f9e65c4e16e7807340fa" fail="return" >

What happens when example.com serves different element:

  1. User clicks on the element from google.com
  2. Browser attempts to download example.com document
  3. Hash are not the same hence hash check failed
  4. User automatically is redirected to google.com

example 2 and behavior: <a href="http://example.com" hash="SHA256:d04b98f48e8f8bcc15c6ae5ac050801cd6dcfd428fb5f9e65c4e16e7807340fa" fail="ignore" >

What happens when example.com serves different element:

  1. User clicks on the element from google.com
  2. Browser attempts to download example.com document
  3. Hash are not the same hence hash check failed
  4. User stays on example.com and nothing happens

example 3 and behavior: <a href="http://example.com" hash="SHA256:d04b98f48e8f8bcc15c6ae5ac050801cd6dcfd428fb5f9e65c4e16e7807340fa" fail="inform:https://google.com/fail?=blablabla" >

What happens when example.com serves different element:

  1. User clicks on the element from google.com
  2. Browser attempts to download example.com document
  3. Hash are not the same hence hash check failed
  4. User stays on example.com and browser informs google.com that hash did not match.

What about canvas and iframe elements?

Nothing.

Document Hash means that dom nodes and document’s content is hashed. However the content of canvas and iframe are not a part of the hash and they can be dynamic even when parent document hash is same.

What about external scripts?

Nothing.

Google can require external script element to have the integrity attribute which is part of dom hence must be part of document hash.

Unexpected benefits in-case external scripts have hash attribute: Cache based tracking would be reduced.


#2

What you’re proposing sounds a lot like an SRI equivalent for external links. That may be useful.

However, for the AMP case, I believe that won’t be sufficient, as it doesn’t allow the embedder (e.g. Google Search) to privately prefetch the content without leaking potentially-private user information to the publisher.


#3

In addition to cryptographic hashes indicated on hyperlinks, we can (see also: .NET assembly linking and strongly named assemblies) consider the titles, versions, cultures/languages and the public keys or public key tokens, if linked-to resources are digitally signed.

We can, more broadly, consider indicating the document metadata of linked-to resources in the linking resources’ hyperlinks.

In another thread, we’re discussing document element metadata. Consider that we could attach to a hyperlink the expected metadata of a linked-to resource.

<metadata id="resource1">
  <meta name="title" content="Example Resource" />
  <meta name="version" content="1.0.0.0" />
  <meta name="language" content="en-US" />
  ...
</metadata>

<a href="http://example.com/resource.html" metadata="resource1" />

#4

What you’re proposing sounds a lot like an SRI equivalent for external links. That may be useful.

However, for the AMP case, I believe that won’t be sufficient, as it doesn’t allow the embedder (e.g. Google Search) to privately prefetch the content without leaking potentially-private user information to the publisher.

If I do not want Google to host AMP but want to assure Google that content did not change, this might actually be a good idea. Yes what I am proposing is more or less SRI for external links.

But a part of the problem is that Google cannot be sure if the code sent to Google bot is the same as the one sent to a user. I do not want Google to prerender or prefetch my pages/web app. This will help Google penalize slow pages without using AMP. AMP restricts JS. JS is not bad if used properly. AMP has proposed WorkerDOM for JS in AMP now but it is not good enough and has it’s on limit on size of JS, CSS. Devs and authors who are responsible can use standardized performance tools to make their apps. Google right now is punishing good websites with fast JS because of AMP.

WebPackages is allowing Google to host the page. What if I do not want Google to host and would want it to be sure of page integrity? This is the answer.

This is the midway between AMP and the open web. Most people do not want Google to host. Google want’s performance assurances. What if I don’t want Google to prefetch the content?


#5

In the case of fail="return" how does the browser notify the user why they were redirected back?

What about sites that use routine static cacheing that change comments on the page for the cache info? The comment changes offset the hash regularly, making them constantly change.

What about the case of a simple site redesign or tweak? User generated content may also change the hash of a page’s content, so how should that be handled? Having a static hash in a link then breaks that link going forward unless the resource is something that is never changing. Unless you tell the link to fail in ignore mode, but then what is the point of the hash?

return mode seems iffy in terms of user experience and possible to trigger way too often with dynamically generated sites.

ignore is just our current web, no issue there.

inform possibly useful but due to the constant dynamic changing of the web it is prone to cause more noise than solutions. inform mode would also have to be restricted so it can only send a report to the current origin (or white-listed ones through a header) since without that this opens up the door to DDoS attacks using un-suspecting end users.

Overall, I don’t see where this fundamentally helps the web ecosystem.

If you can provide clear examples of this then that would be fantastic. Because, in theory if everything between two sites is equal in terms of overall search ranking quality and page performance, they are equal. And if the normal site is faster than the AMP site then it should have the higher ranking. So if there is a clear example of where this is not the case, it would be useful to see so it can be forwarded along to the proper teams.

You need to define the context of prefetching here. If you mean Googlebot fetching it, you block Googlebot. If you mean the User Agent (Chrome, Firefox, etc. Anyone implementing the prefetch/prerender stuff) then that’s not your choice as a website author. Those are UA features to help improve the end user experience. End users have the control to turn that off in their settings.


#6

In the case of fail="return" how does the browser notify the user why they were redirected back?

A “fail” event is fired.

What about sites that use routine static cacheing that change comments on the page for the cache info? The comment changes offset the hash regularly, making them constantly change.

It should be in an Iframe. Alternatively a website could decide they will not show comments on the page with restrictions and therefore just add a button which will take them to the next page. If they have a service worker, it should take 200ms.

What about the case of a simple site redesign or tweak? User generated content may also change the hash of a page’s content, so how should that be handled? Having a static hash in a link then breaks that link going forward unless the resource is something that is never changing. Unless you tell the link to fail in ignore mode, but then what is the point of the hash?

The point of the has is to hash the DOM nodes text. Not the CSS states or design. Sites with user generated content could embed an Iframe for usergenerated content, or alternatively draw it on canvas.

Alternatively we can have an HTML tag like <dynamic> which will not have restrictions for content inside it.

return mode seems iffy in terms of user experience and possible to trigger way too often with dynamically generated sites.

When implemented properly, this should not be the case. You also have the option to inform the referer, so Google can take action for future visitors.

If you can provide clear examples of this then that would be fantastic. Because, in theory if everything between two sites is equal in terms of overall search ranking quality and page performance, they are equal. And if the normal site is faster than the AMP site then it should have the higher ranking. So if there is a clear example of where this is not the case, it would be useful to see so it can be forwarded along to the proper teams.

News carousels are exclusive to AMP even if original site is properly optimized.

You need to define the context of prefetching here. If you mean Googlebot fetching it, you block Googlebot. If you mean the User Agent (Chrome, Firefox, etc. Anyone implementing the prefetch/prerender stuff) then that’s not your choice as a website author. Those are UA features to help improve the end user experience. End users have the control to turn that off in their settings.

I mean AMP without Google hosting it. I like to host content myself.