It might be possible to use the platform audio capture and
speech recognition capability.
The other thing that would need to happen is conversion of the
speech recognition output to an appropriate time-stamped format (such as Web
VTT). In the absence of platform level mechanisms for creating time-stamped
formats, this would presumably need to happen at the browser level?
Time-stamps formats is another consideration. Web VTT is the
accepted HTML5 format, but is there a use case for producing other caption
formats in this scenario? QT (Quicktime), SAMI/SMI (Windows Media), SCC (ios
amongst other things) and others for example..