Voice Assistant web integration


#1

Following a session I helped to run at the Mozilla Festival recently, I would like to start a discussion about how the Web could better evolve to accommodate voice as a first-class method for input and output. Specifically, how web pages could integrate with Voice Assistants.

We have an API and tools for voice output on the Web: the Speech Synthesis API and screen readers, respectively.

We also have an API for voice input on the Web (though not widely supported): the Speech Recognition API.

However, we do not have the ability to integrate open web content and services with ‘smart’ Voice Assistants (I’ll shorten to “VAs” from now on) - the likes of Alexa, Bixby, Cortana, Google Assistant, Siri and any others that could arrive on the scene in the future. (Mozilla are also working to collect open voice data).

An example use case (courtesy of @patrick_h_lauke) is that on a travel website, as a potential alternative to using mouse/keyboard input on date fields etc, you could use your favourite VA to say something like: "find flights from Manchester to Frankfurt for Wednesday next week, coming back the following day”.

I can imagine various options/possibilities for better voice integration for websites:

  1. Integration with existing APIs could be provided through configuring a browser to use a particular VA, for example to provide the voice for the Speech Synthesis API. Perhaps they could also be enabled as input providers for the Speech Recognition API, although the UX of this would require further thought.

  2. VAs specifically enabled/integrated in users’ browsers could be granted access to some or all website page data, when viewed. They could potentially be limited to areas of the page defined as “speakable” via Schema.org’s vocabulary for speakable content (thanks to Léonie Watson for mentioning this specification to me).

a) Even without any further work from developers, this could allow VAs to read aloud page contents.

b) Extra markup/metadata could also be included by web page authors to help provide hints to VAs, for example additional keywords for input fields. Using this information, complex interactions could become possible, with the onus on the VA to work out what elements are on the page and how to interact with them (e.g. form fields).

  1. A website could advertise a particular voice service as being available on that page. Using a particular schema and defined on the page itself (or separately in open registries?), the website could define available voice commands and their actions. Mozilla have been starting to explore this kind of approach under the name “Voice HTML” here.

  2. A new API could be provided via JavaScript to give developers hooks into commands intended for the website, via their VA, when viewing the current page. This would require careful privacy design, to ensure that websites could not capture any unintentional voice data.

VAs are likely to grow to become more and more important for accessibility, especially for users who are unable or less able to use their hands. And as VAs become smarter, they will power more of our interactions with technology. I suspect that if we do not consider this from an open standards perspective, in the future we may start to see browsers introducing custom and specific integrations with particular VAs. This could result in limitations for users, for example only being able to use Company X’s VA with that company’s browser. I hope that we can design a more open alternative.

I realise these initial thoughts are very high-level and I recognise that I am neither an expert in browser engines or VAs. However, I think this could be an exciting area and I would like to help spur conversations to help design a better future for the voice-powered Web. I welcome any thoughts and comments!


#2

At its core, telling an web application something should just require:

  • the event that a user is about to speak.
  • the stream of voice (could be streaming text or could be the actual audio stream).
  • Optionally, a signal that the user has stopped speaking.

From there, the web application should do it’s own processing and decomposition of the input into commands. The application has the domain knowledge to decipher the user’s intent. Thus, if folks want to go down the “Voice HTML” route, they can do that freely. Alternatively, different folks can use different processing engines to derive meaning (e.g., sending it into WASM, and doing some fancy pants AI magic to work out what the person is talking about and what they want to do).

Thus, roughly:

window.addEventListener("voicecommand", e => {
   e.stream.pipe(processVoiceInputRealTime).pipe(command => doThis(command));
});