[Proposal] Allow Media Source Extensions to support demuxed and raw frames


JSON/RAW MediaSource Byte Streams Explainer


The world of audio and video codecs and container formats is diverse. For reasons of security, licensing, size restrictions, and more browsers can’t intrinsically support all codecs and formats for audio / video playback. However that doesn’t mean we can’t offer a way for developers to plug decoders and/or demuxers written in JavaScript or WebAssembly into our media pipelines.

We propose a a couple of Media Source Extensions byte stream format additions which will allow injection of demuxed or raw audio and video frames.

Audio Proposal

For audio, we propose adding support for ‘audio/wav’ to handle raw audio. For non-raw audio data we already have ‘audio/aac’ and ‘audio/mp3’.

Video Proposal

For video, we propose adding support for ‘video/raw’ which would use a simple 23-byte header based on IVF to encapsulate both raw and encoded samples.

bytes 0-3    FourCC (e.g., 'VP80', 'I420', 'AV01', etc)
bytes 4-5    visible width in pixels
bytes 6-7    visible height in pixels
bytes 8-15   64-bit presentation timestamp
byte  16     color space primary id.
byte  17     color space transfer id.
byte  18     color space matrix id.
byte  19     color space full range flag.
bytes 20-23  size of frame in bytes (not including the header)

Sample type would be determined by the fourcc embedded in the header; e.g., I420 for planar YUV 4:2:0, P010 for planar 10bit YUV 4:2:0, etc. For encoded data, we would use ‘AV01’, ‘VP80’, ‘VP90’, and ‘H264’.

Color space values will come from the AV1 codec specification.

Use cases

  • Current adaptive streaming libraries are transmuxing from MPEG2-TS to ISOBMFF / MP4 to handle HLS playback using Media Source Extensions. This would allow them to skip the remuxing step.
  • Experimental codecs could be launched as downloadable WebAssembly/JS packages that individual sites can iterate and experiment more quickly with.
  • Developers could add support in WebAssembly/JS for obscure or otherwise unsupported codecs; E.g., one could imagine a web based version of ffmpeg’s ffplay tool offering support for all of its codecs through this API.

Proposed API / Example

  let video = document.createElement('video');
  let mse = new MediaSource();

  mse.addEventListener('sourceopen', function() {
    let audio = mse.addSourceBuffer('audio/wav');
    let video = mse.addSourceBuffer('video/raw');

    video.addEventListener('updateend', function() {
    }, {once: true});

    audio.addEventListener('updateend', function() {
      audio.timestampOffset = firstVideoPresentationTimestamp;
    }, {once: true});
  }, {once: true});

  video.src = window.URL.createObjectURL(mse);

Open Questions / Notes / Links

  • Is it folly to use fourcc codes for pixel formats? These don’t seem standardized. ffmpeg has one definition and Microsoft another definition.
  • Unfortunately opus and vorbis don’t have specified MPEG1 packetizations – only ogg, mp4, and webm. So if developers want to use demuxed Opus or Vorbis it will need to be in an ISO-BMFF or webm container unless we add an Ogg demuxer to MSE.
  • Do we need stride data for raw formats? Requiring all data to be packed into the visible range may have performance implications. Just having a coded size field and formula for how that maps to plane sizes probably solves most issues.

Hi, thanks for your proposal!

This would be really useful for my ogv.js project which we use at Wikipedia to play WebM and Ogg media in Safari, as well as potential use cases I’ve got in mind for real-time video processing such as frame-accurate compositing in an on-web video editor.

A couple notes:

  • FourCCs are indeed not very standardized. :frowning: An alternative is to use the codec parameter values from MIME types, so eg ‘opus’, ‘vp8’, ‘avc1.42E01E’ etc but these are inconveniently variable-length.
  • I would strongly recommend making a normative list of must-be-supported raw pixel formats, and a way to test for support. (This could be done with MediaSource.isTypeSupported(‘video/raw; codecs=“I422”’) maybe?)
  • Header data is an issue for some codecs, as noted, though when transcoding to raw data (my use case) that won’t be an issue.
  • For raw video, a stride would be nice as it can avoid an extra copy… but only if the planar data is arranged correctly to begin with, which I think it would often not be because of macroblock size not always being even to the number of visible pixel rows. Which leads us to having a stride and a frame height?
  • More importantly, having both display area and visible area would be really nice for raw video, as videos don’t always come in square pixels; when we’re transcoding from media sourced from old MPEG-2 files this often comes up.
1 Like


  • We could definitely require something like I420 if other browsers are on board.
  • I think just having coded size would resolve your stride issues. It would simply mean the decoder allocates a bit more when decoding planar formats. Any decoders would need to be changed to allocate a single allocation it breaks up into planes anyways.
  • Seems like we might need coded size, visible size, and natural size, just like we use internally then. Though, when providing raw frames, it seems you could just set the visible size dependent on your pixel size? So long as we have a coded size that is.

(Repeating your comment about composition with WebCodecs): Yes I had that thought as well. If we end up with standardized definitions for an EncodedPacket and DecodedFrame we could add append methods for arrays of those to MediaSource SourceBuffer objects.

  • It’s possible that has a longer standardization process since it requires new APIs versus an extensions to the byte stream registry. It’s also unclear exactly how a wasm decoder would be able to directly write into a JS object. Possibly the object can use ArrayBufferViews that can point into wasm memory.

I believe the information included here isn’t sufficient to be able to handle all codecs.

Here is the following information that we end up needing to decode raw frame in Gecko.

bool KeyFrame
bool EOS (used for vorbis content to determine for trimming is needed on the decoded output)
uint32_t discardPadding (used to indicate that audio is to be trimmed at the front)
uint64_t presentationTimestamp
uint64_t decodeTimestamp (required for MSE)
CryptoSample cryptoData
ExtraData() -> required for H264 content with out of band SPS/PPS
TrackInfo() : either AudioInfo / VideoInfo ; this could be provided in an init segment like.

For video, you need not just the visible size but also the picture size. Picture Size vs Display Size let you determine the aspect ratio


ImageSize ( include cropping info)
Rotation // describe how many degrees the decoded image must be rotated
ColorRange // full / limited
CodecSpecificConfig (array of bytes)


Bit depth
CodecSpecificConfig (array of bytes)

That’s just of top of my mind. There may be more. But it indicates that the proposed format doesn’t catter for all content.