RFC: Proposal for integration Streams <--> MediaStreamTrack API


#1

Proposal for a (small) API to allow creation of Streams out of a MediaStreamTrack:

https://github.com/yellowdoge/streams-mediastreamtrack (the following text is mostly extracted from the README.md).

Streams are designed to provide real time streams of data with powerful semantics (e.g. built-in backpressure and queuing) to allow users to build higher-level abstractions.

MediaStreamTracks are opaque handles to Real-Time media being transported in the browser. This media is produced or consumed via Sources and Sinks offered by the platform (and detailed in a number of Specs); however, if a given functionality is not readily available, user’s options vary. The capabilities for audio processing are quite developed thanks to the MediaStreamTrack-WebAudio bridge and its ScriptProcessorNode. For video, however, unsupported source/sink functionality forces users to resort to contortions such as reflexion on intermediate HTML elements (e.g. <canvas>, see the Workarounds Section) or offline processing (e.g. using MediaRecorder). These approaches, however, lose the timing information, introduce friction in the interoperability between elements and need unnecessary processing steps. This situation is made only more evident with the arrival of powerful programmable environments such as WebAssembly where users will naturally expect to be able to manipulate Real-Time media.

This API reconciles these two existing ways to access Media on the browser, enhancing platform ergonomics and orthogonality while only attempting to define the minimum amount of new data types (VideoFrame).

Use cases

Use cases that depend explicitly on timing are enabled, e.g.:

  • Measuring the amount of Video Frames produced; calculating the source frame rate.
  • Calculating inter-frame measures, e.g. motion flow, presence/absence, stabilization,
  • Adding subtitles to video/audio.
  • Manipulating depth data streams.

Whereas use cases that do not depend explicitly on timing are not enabled but are enhanced:

  • Producing per-frame analysis and transformations and the endless array of digital image processing algorithms, e.g. edge enhancement or chrome keying.
  • Adjusting the presentation timestamp of the media to speed up or slow down video, hence creating a timelapse or slow-motion effect.

Possible future use cases

Making a MediaStreamTrack both a ReadableStream and a WritableStream (i.e. a TransformStream) would allow for even more sophisticated use cases where users could ‘plug’ custom elements in MediaStreamTrack-based pipelines, e.g. WebAssembly operations between WebCam capture and Recording, etc.

Current Workarounds

The most usual hack to access Video data is to cast a given MediaStreamTrack onto a <video> element and in turn onto a <canvas> that is subsequently read back – <video> elements provide no drawn event so it’s up to the user to blit from <video> to <canvas> on a timely basis (see e.g. this article). Moreover, usually reading from <canvas> implies a costly read back from GPU and potential pixel conversions needed.

Chrome Pepper API introduced and supports both MediaStreamVideoTrack and MediaStreamAudioTrack addressing a similar situation as the one described.

WebAudio ScriptProcessorNode (or its successor) enables similar use cases for Audio.

Potential for misuse

Security-wise none, since the security model of MSTrack still applies. As with every media-related API, it can be maliciously used to drag down user’s CPU.

Rough sketch of a proposal

In a nutshell, it’s trivial:

partial interface MediaStreamTrack {
  // |any| should be ReadableStream, but that is not an idl type.
  [CallWith=ScriptState] readonly attribute any readable;
};

which produces ImageData with a timecode information. ImageData provides width, height and data in RGBA format, and can casted onto a <canvas>. The draft API has provisions for I420 formats though, which is more native to Real-Time video.

See the example.


#2

LGTM in general; ReadableStream is used an IDL type e.g. in fetch, so I would use it here as well.

CallWith on the other hand is a Blink-ism :slight_smile: