Proposal to advertise UA automation


#1

Hi,

I’d like to discuss benefits of advertising user agent automation. I started the topic at webappsec (https://lists.w3.org/Archives/Public/public-webappsec/2017Jan/0004.html), but this mailing list seems to be more suitable for the topic.

The idea is to attach an HTTP request header to navigation requests that are initiated by automation tools, by which I mean headless browsers, web driver driven browsers, etc.

The benefit for the webste operator is to have a choice in responding to such requests differently. For example, do not serve ads, suggest using API scraping rather that loading heavy resources, send through failed CAPTCHA route, etc.

This approach intersects with robots.txt a little, but none of modern UA automation tools honor robots.txt, and implementing the advertising flag seems to be relatively easy.


#2

I’d like to hear more information on the actual use-cases proposed and try to understand how they don’t detriment the web.

  • Do not serve ads
    • If this were implemented by the header, then anyone could just build an extension to add the header to all their requests to avoid getting ads. Essentially, using the header as an ad-blocker where supported. That hurts sites needing to generate revenue to support their business and also doesn’t help the businesses that would implement it. I don’t see a benefit coming from this point.
  • suggest using API scraping rather that loading heavy resources
    • How would this “suggestion” work exactly?
    • Also remember headless mode is not necessarily web scraping usage. There could be accessibility tools built for people that take advantage of it. You wouldn’t want to “suggest” they use APIs to get data when the tools are built to interpret page content.
  • send through failed CAPTCHA route
    • This ends up in the same boat as not serving ads. If a site developers builds in a way that one header bypasses security and authorization mechanisms, then all someone wanting to attack that site would need to do is send the header. Whether or not it is accurate as to what is happening. This once again, detriments sites over helping them.

Before advocating vendors add in such a header we should have solid examples of how it would be applied in the real world; providing benefits to end-users and without opening detrimental affects to web authors. I can’t think of a scenario where the web author would know better than the end user what they want or need by using a headless browser.


#3

copying from https://lists.w3.org/Archives/Public/public-wicg/2017Jan/0004.html: From: Nottingham, Mark mnotting@akamai.com Hi,

I didn’t see the thread on discourse, so I’ll respond here.

One thing to keep in mind is that sites which actually do send difference content will need to also send a Vary header if the content is cacheable; otherwise, a cache on the same path could serve “headless” content to a browser, or vice versa. E.g., if a reverse proxy or CDN is being used.

On its own that’s not a huge deal, but it will inflate the size of responses a bit; Vary needs to be sent on all responses for a resource that it applies to, including the “default” ones (i.e., responses to requests that don’t have this header).

Aside from that, I wonder how many headless agents will actually use this, since their typical use is to get whatever the browser does, or as close to it as possible.

Cheers,

Mark Nottingham mnot@akamai.com https://www.mnot.net/


#4

Hi, I’m one of the authors of the WebDriver standard, and if browsers are under automation they are required to populate the Navigator object with a boolean attribute navigator.webdriver so that content JS can take appropriate measures.

Adding a header to each HTTP request made whilst the browser is under automation immediately strikes me as a bit heavy weight and would require existing implementations to integrate at a deeper level with the network stack, since HTTP requests are not usually explicitly made by the tools.

WebDriver tries to emulate user interaction, and HTTP requests are only implicitly initiated through synthesising trusted click events, keyboard input, &c.


#5

I’d like to hear more information on the actual use-cases proposed and try to understand how they don’t detriment the web.

Folks have opportunities to hide javascript object properties that some automation tools expose, change user-agent string, randomize timing between user events to hide presence of automation. Nothings stops them from doing the opposite: expose fake objects properties or change user-agent string to phantomJS. So the header doesn’t add any extra possibilities, but rather makes things more convenient for both sides.

The presence or absence of this header should not be interpreted on its own, but rather as one part of a solution to address automation detection.


#6

That’s a handy option, although would require for a website a roundtrip to read that flag, while a header would deliver it in initial request.

I agree that exposing the flag to each request is heavy, and is can make things go out of website operator’s control, as resources hosted by third parties would get knowledge that they should not necessarily have. That is why I believe this flag should be exposed only on navigation requests.

Do you happen to know the feature support timeline for navigator.webdriver, as I checked stable Firefox, Chrome and Safari, and none of them exposes the property yet.


#7

What would prevent these same folks from modifying headless browser binaries to not include said header?


#8

So, if it is just one possible part, assuming it is even accurate, then why add it at all? It is an extra header on every request that may or may not get used. And even if it is used, could be used to detriment web experiences. So far there hasn’t been one solid example of how sites can utilize this effectively without running the risk of hindering users or themselves.

Use cases rule pushing things forward. If we don’t have any solid ones then there won’t ever be a reason to persuade vendors to implement anything that is requested. Let’s focus on use cases first, implementation details second.


#9

Let’s be clear that the security model here is different to that of browser content. You can modify any user agent to disable security measures if you have access to the browser binary, not limited to whether a WebDriver-controlled browser sends a specific header.

I can see a potential use case that websites want to prevent remote controlled browsers from crawling their website, but giving them a header is comparable to UA string sniffing, which websites already do to gate certain browsers from accessing content. Dedicated users spoof the UA string to circumvent this, both for pleasure and automation.

Giving a UA under automation a header would ensure the UA string remained unchanged, potentially letting websites trigger an ‘automation mode setting’ but still trigger the UA sniffing behaviour.

In any case, I’m not sure this is desirable to give websites another mechanism for sniffing the client, potentially fragmenting the web further. On the other side, because it’s impossible to prevent a user modifying the UA, maybe an automation header would be useful if only for convenience.

However, the use cases listed here so far are not convincing: If sending automation tools could avoid CAPTCHAs and ads by sending a header, what would prevent users from modifying their own browsers to do the same? I think the only safe option for triggering ‘test mode’ behaviour on your website is to spin it up an instrumented version in a controlled test environment.


#10

Do you happen to know the feature support timeline for navigator.webdriver, as I checked stable Firefox, Chrome and Safari, and none of them exposes the property yet.

Firefox is the only implementation of the W3C WebDriver I know of, but we haven’t implemented the navigator.webdriver fingerprint yet.


#11

This is exactly why we should be focusing on use cases over the implementation details. Currently, the only primary quasi-valid use case is “recommending API usage over scraping”. However, that has numerous short-comings and only leads site developers into a false sense of control.

Even without access to the browser binary, extensions in Chrome and FF at least are capable of doing this for anything on the network layer. So adding something to modifier the headers is quite trivial.