A partial archive of discourse.wicg.io as of Saturday February 24, 2024.

Efficiently Get ImageData from a Video MediaStreamTrack


Per advice from @annevk, posting this here as a problem-oriented discussion, rather than the solution-oriented proposal I made on the whatwg/html repository.


Getting ImageData for a frame from a video MediaStreamTrack, so that it may be processed in JavaScript, is inefficient. Highly inefficient in terms of memory usage, and moderately inefficient in terms of CPU usage.

If trying to process all frames in a video, the ImageData garbage (frames you’ve already processed) can quickly become large, for 1080p@30 FPS, that’s 240 MB/sec of discarded frames. Depending on the GC implementation, and the power of the device, this can render it unusable. Firefox on a Raspberry Pi 3, for example, will eat through all 1 GB of system memory in a few seconds.

CPU efficiency is also a problem. The current idiom for getting ImageData goes through a canvas, which are increasingly being made GPU-backed for better performance. For capturing video frames from a local webcam, this will mean CPU-to-GPU-to-CPU, which is unnecessarily inefficient.

An earlier post here (RFC: Proposal for integration Streams <--> MediaStreamTrack API) by @miguelao had a proposed solution which dealt with both of these inefficiencies: avoided using the canvas as an intermediary, and had the potential to alleviate memory problems by enabling ReadableStreamBYOBReader (bring-your-own-buffer) since it was adding ReadableStream to MediaSteamTrack.

Two years on from that proposal it looks like it stalled out. The overall problem outlined still exists.

Use Cases

  • Barcode reading
  • Face tracking
  • Machine learning
  • Augmented reality

With WASM there’s potential to quickly bring existing libraries which do these things into the JS world, and all you’d need to do is feed them the video data. Inability to efficiently get this data greatly hinders this, especially on lower-spec devices.


There was a proposal from Mozilla that was discussed a fair bit at TAG and elsewhere to allow for efficient off-main-thread and GPU/video-memory based interaction with video frames. Even more important than the memory considerations is getting the data processing off the main thread. The person responsible for the proposal is no longer with Mozilla, but the proposal is still around somewhere, and it directly addresses this set of use-cases (they were the driving force behind it, along with other related ones).

Jan-Ivar or Nils can likely help find the proposal and perhaps recap the status. I’ll point them here.


/cc @mounir


@miguelao and @Dale_Curtis are the right contacts on the Google side.


@jesup is referring to FoxEye, an old experiment at mozilla that resulted in the (since abandoned) https://w3c.github.io/mediacapture-worker/ proposal. More recent strawman proposals like https://alvestrand.github.io/audio-worklet/ were briefly discussed at TPAC 2018, but the WebRTC WG decided to focus on finishing 1.0 at that time, and to wait and see how audio worklets pan out. TPAC 2019 might be an opportunity to revisit this problem, though there’s some question whether it belongs in the WebRTC or Media WG.

Taking a step back and discussing the problem here SGTM. I hear

  1. Avoid GC/CC-triggering buffer usage
  2. Move CPU usage off main-thread
  3. Limit needless copying (CPU vs GPU) by aligning with browsers’ media stacks where possible

I’d add: 4) Clarify whether the need is read-only (semi-realtime) vs read-write (realtime).

E.g. a web worker API might satisfy things like face tracking. But if people expect to read & modify video track data in real-time before playout (or producing a MediaStreamTrack), then something that can keep up with audio worklets might be the better.

We don’t have a lot of experience with workers and media. Worklets are severely locked down to keep GC/CC in check, which sounds good, but it might be quite limiting and hard to get data off its specialized thread. Video is frame-based, so maybe timing requirements are looser? OTOH, video data dwarfs audio. This is a tricky area to guess about, at least for me.

Another problem is where to put this. The web platform breaks down workers based on top-down problem domains:

  • web workers = page CPU offloading
  • service workers = network caching
  • worklets = render pipelining

Yet we’re discussing this as a bottom-up optimization, which doesn’t inform much about where to put this.


Similar concept Proposal: videoWorklet [worklets-1] #905