The getUserMedia function is a wonderful tool that capturing audio and video from the user’s device. For A-level compliance with WCAG 2.0 requires captions or transcripts for the audio.
Currently creating accessible applications that use getUserMedia require a secondary entry for the user to either upload a WebVTT file, or to manually enter the text for a transcript during creation time (or creating the captions/transcripts after the fact). This creates an extra burden on the application and leads to less development of accessible tools. If it was somehow possible to upgrade getUserMedia to additionally output a best-guessed transcript or webVTT file then it will better facilitate the creation of more accessible applications that can serve the needs of more users.
If I get this right, this would require having universal and reliable speech recognition in browsers, and that is probably beyond what can be reasonably expected today.
There are on-line services that do best-guess attempts at captioning videos, so the current fallback would be to see if these services offer APIs for that, and have the captured video be uploaded there for transcription.
That was the belief that I had originally, but then it occurred to me: why would calling out the operating systems native speech to text functionality be more unreachable then accessing the camera or microphone? One should be able to check for things programmatically (like we check for access to the microphone or the camera) so we would have to use those where applicable.
Would it make more sense to change/augment the SpeechRecognition interface to accept an audio stream instead of just accepting input from the microphone? You could then pipe the video/audio content through to it.
It might be possible to use the platform audio capture and
speech recognition capability.
The other thing that would need to happen is conversion of the
speech recognition output to an appropriate time-stamped format (such as Web
VTT). In the absence of platform level mechanisms for creating time-stamped
formats, this would presumably need to happen at the browser level?
Time-stamps formats is another consideration. Web VTT is the
accepted HTML5 format, but is there a use case for producing other caption
formats in this scenario? QT (Quicktime), SAMI/SMI (Windows Media), SCC (ios
amongst other things) and others for example…
I don’t think that’s true. Occasional, unreliable speech reco is often better than nothing for accessibility.
And it doesn’t have to be in the browser itself. There are a number of speech-to-text engines around in browsers, OS, and online. Having a simple way to pipe the audio to one of those and get a text feed back, that gets attached to the audio, seems like a useful thing. Although being able to determine where it gets piped is a privacy and security issue, the mechanics seem pretty straightforward - the question is where in the pipeline any buffering / re-synch should take place.