Per advice from @annevk, posting this here as a problem-oriented discussion, rather than the solution-oriented proposal I made on the whatwg/html repository.
Problem
Getting ImageData for a frame from a video MediaStreamTrack, so that it may be processed in JavaScript, is inefficient. Highly inefficient in terms of memory usage, and moderately inefficient in terms of CPU usage.
If trying to process all frames in a video, the ImageData garbage (frames you’ve already processed) can quickly become large, for 1080p@30 FPS, that’s 240 MB/sec of discarded frames. Depending on the GC implementation, and the power of the device, this can render it unusable. Firefox on a Raspberry Pi 3, for example, will eat through all 1 GB of system memory in a few seconds.
CPU efficiency is also a problem. The current idiom for getting ImageData goes through a canvas, which are increasingly being made GPU-backed for better performance. For capturing video frames from a local webcam, this will mean CPU-to-GPU-to-CPU, which is unnecessarily inefficient.
An earlier post here (RFC: Proposal for integration Streams <--> MediaStreamTrack API) by @miguelao had a proposed solution which dealt with both of these inefficiencies: avoided using the canvas as an intermediary, and had the potential to alleviate memory problems by enabling ReadableStreamBYOBReader (bring-your-own-buffer) since it was adding ReadableStream to MediaSteamTrack.
Two years on from that proposal it looks like it stalled out. The overall problem outlined still exists.
Use Cases
- Barcode reading
- Face tracking
- Machine learning
- Augmented reality
With WASM there’s potential to quickly bring existing libraries which do these things into the JS world, and all you’d need to do is feed them the video data. Inability to efficiently get this data greatly hinders this, especially on lower-spec devices.