HTML Sanitizer API in the browser

Hi folks,

Together with @mikewest and Mario Heiderich (@cure53), we have started looking at how one could specify an HTML Sanitizer that lives in the browser. I could try to summarize all the greatness in here, but I think we’ve done a great job in the Sanitization explainer already :slight_smile:

Please let us know what you think. I’m thrilled to hear more!


Hello there :slight_smile: I signed up and look forward to your feedback as well!

To start with something:

1 Like

I like your proposal but it doesn’t give me a feeling that this is something we can do without tremendous effort.

I wonder if it makes sense to overshoot and thereby risk losing focus on the actual goals. Do we want to build a sanitizer that developers can use or do we want to have another complex beast like CSP that ends up being misused, misunderstood and sinkholes everyone’s resources?

The beauty of the proposed Sanitizer API lies in the simplicity and the low implementation effort. Your proposals are valuable, but maybe something we should tackle in v2.0. Wdty?

I like this idea. Is there any reason why we couldn’t just add this “purification” to the APIs that already exist? Like the innerHTML setter? appendChild() etc?

No, we just want to start simple and create a solid foundation. The more fanciness we add now, the harder it will be to build the MVP.

In order for the sanitizer API to be usable, the browsers need to use the lower-level primitives anyway (e.g. they have to have a concrete whitelist, they need an inert DOM). It might be just a matter of exposing those primitives as a web API - which I believe might be simpler to agree on, than e.g. a list of necessary customization options and hooks for the sanitizer, and their actual implementation.

For example, for clients to migrate to a native sanitizer, the native impl. needs to provide a superset of customization options for all large userland sanitizers (in other words, SAFE_FOR_JQUERY, name policies, URL policies and all other features must be possible to implement via the exposed native hooks). A missing configuration knob would mean that a DOMPurify-ng (backwards compatible & wrapping over the native sanitizer) would have to make a second pass, legacy sanitization.

I’m just wondering if this wouldn’t be actually simpler to agree on & implement the primitives that don’t require that many config knobs to be functional (and so far over the years the client side sanitizers kept adding more config settings, suggesting that the devs actually need them). Lots of those primitives are already there and just need some finishing touches (e.g. the elem/attr list is implemented in browsers due to existing sanitizers, inert DOM APIs exists, but is inconsistent, DOMParser exists etc.). This task might be simply less bikesheddy, and more tangible to implement, than a full sanitizer.

Sadly we don’t have any metrics to check who uses what config flag and why. We’d have to crawl Github and alike to learn more about the actual usage.

From the more or less anecdotal data we gather in pen-tests, we can see that the majority of websites/apps/extensions just use the default. But that’s not really good data anyway, just mentioning.

Possibly a dumb question, but what’s the problem with the following:

function sanitizeAsFragment(html) {
  const sanitizingSelector = "script,style"; // for example
  const fragment = document.createRange().createContextualFragment(html);
  for (const element of fragment.querySelectorAll(sanitizingSelector)) {
const userGeneratedHtml = "<script>alert('Hello buddies')</script><a href=''>Example</a>"

Technically nothing at all, it’s just very limited and too simple to address XSS.

Let’s say you want to remove all anchors but only if the href has a JavaScript URI. Then you also wanna remove all event handlers from all elements. Then you wanna take care of meta and base tags.

Then, the further you go, more and more special cases pop up. And after a while you realize that you just re-implemented DOMPurify :smiley:

Our goal here is to create an API where all those worries can be left behind. And where a developer doesn’t have to trust some random folks (i.e. me) for any sanitization needs.

1 Like

Thanks for your explanation, I forgot about those onxxx event handlers and TIL base tag.

So situations can be pretty hard to sanitize everything, perhaps we can instead introduce a safety zone?

  <script>alert("This won't run in a safety zone");</script>
  <a href="javascript:console.log('This also won't run')"></a>
  <a onclick="console.log('because')">A safety zone won't allow running any JavaScript code</a>
  <base href="http://nor-anything-that-affect-global.document" />
  <style>.including-styling { display: none; }</style>
  <link href="" rel="stylesheet" />

I think “jails” were discussed way earlier and are technically a different topic. Here, we discuss mostly the sanitization approach.

The thing is, we know already what needs to be sanitized and how, even the crazier cases like template expressions, data-dash attributes, class attributes depending on the framework in use, id and name, etc. etc. - this is all not new. The idea is basically, take all the learnings and put them where they imho belong. The browser :slight_smile:

1 Like

It might be useful in the explainer or in other documentation to describe what “we know already […] needs to be sanitized”. I had to get to the very end of the README before I concluded that it was just a proposal to make content static, with no executable JavaScript, and apparently the main threat model is within-page XSS. I would suggest that “Wait, what does secure even mean in this context?” be answered, very specifically, within the first couple paragraphs.

Are data-dash or class attributes unsafe because a site’s particular framework might take actions based on them? That seems especially site- and framework- specific and I’m not sure that makes sense coded into a browser. If a new framework becomes popular that uses something different to trigger JavaScript or other logic, will we have to update the browser sanitization API?

We believe we have mentioned this already, check here:

Would you say we should position this more prominently?

As for the framework-craziness causing formerly harmless attributes to be dangerous: The browser wouldn’t have to update their code, the developer would have to update the config, that is all. This is no different to what we have today with JavaScript-written sanitizer libraries.

The browser can offer what the JavaScript-library, i.e. DOMPurify does as well: A safe default unless you have a nut-job JS-framework in use. If that is the case, you have to re-configure the whitelist anyway :slight_smile: And with data-dash being blocked, we would catch a lot already.

And let’s keep in mind, we cannot ever offer perfection in this realm, only increasing quality that serves the majority. There will always be edge cases of some sort - i.e. XSS from class atrributes.

Yeah, that’s what I mean. I read that exact README, but I didn’t really understand the solution until I got to the final question in the FAQ. As it is it might be assuming a lot more existing familiarity with the existing third-party libraries.

Sure. It’s kind of a pity that data attributes would be in the default blocklist though, as copying and pasting in HTML with marked-up microdata seems like exactly the kind of use case we’d want to encourage. (That’s not necessarily a critique of the proposal, just a lament about how maybe data attributes are being used most often not for their intended purpose but instead as embedding behavior within markup.)

I would then propose to add a short “In a Nutshell” section at the top of the explainer document, quickly explaining problem, where we are knowledge wise and solution in like four bullet points.

Makes sense and fixes the issue?

1 Like