[Proposal] Allow Media Source Extensions to support demuxed and raw frames

https://github.com/dalecurtis/raw-mse/blob/master/explainer.md

JSON/RAW MediaSource Byte Streams Explainer

Introduction

The world of audio and video codecs and container formats is diverse. For reasons of security, licensing, size restrictions, and more browsers can’t intrinsically support all codecs and formats for audio / video playback. However that doesn’t mean we can’t offer a way for developers to plug decoders and/or demuxers written in JavaScript or WebAssembly into our media pipelines.

We propose a a couple of Media Source Extensions byte stream format additions which will allow injection of demuxed or raw audio and video frames.

Audio Proposal

For audio, we propose adding support for ‘audio/wav’ to handle raw audio. For non-raw audio data we already have ‘audio/aac’ and ‘audio/mp3’.

Video Proposal

For video, we propose adding support for ‘video/raw’ which would use a simple 23-byte header based on IVF to encapsulate both raw and encoded samples.

bytes 0-3    FourCC (e.g., 'VP80', 'I420', 'AV01', etc)
bytes 4-5    visible width in pixels
bytes 6-7    visible height in pixels
bytes 8-15   64-bit presentation timestamp
byte  16     color space primary id.
byte  17     color space transfer id.
byte  18     color space matrix id.
byte  19     color space full range flag.
bytes 20-23  size of frame in bytes (not including the header)

Sample type would be determined by the fourcc embedded in the header; e.g., I420 for planar YUV 4:2:0, P010 for planar 10bit YUV 4:2:0, etc. For encoded data, we would use ‘AV01’, ‘VP80’, ‘VP90’, and ‘H264’.

Color space values will come from the AV1 codec specification.

Use cases

  • Current adaptive streaming libraries are transmuxing from MPEG2-TS to ISOBMFF / MP4 to handle HLS playback using Media Source Extensions. This would allow them to skip the remuxing step.
  • Experimental codecs could be launched as downloadable WebAssembly/JS packages that individual sites can iterate and experiment more quickly with.
  • Developers could add support in WebAssembly/JS for obscure or otherwise unsupported codecs; E.g., one could imagine a web based version of ffmpeg’s ffplay tool offering support for all of its codecs through this API.

Proposed API / Example

  let video = document.createElement('video');
  let mse = new MediaSource();

  mse.addEventListener('sourceopen', function() {
    let audio = mse.addSourceBuffer('audio/wav');
    let video = mse.addSourceBuffer('video/raw');

    video.addEventListener('updateend', function() {
      video.appendBuffer(videoData);
    }, {once: true});

    audio.addEventListener('updateend', function() {
      audio.timestampOffset = firstVideoPresentationTimestamp;
      audio.appendBuffer(pcmAudioData);
    }, {once: true});
  }, {once: true});

  video.src = window.URL.createObjectURL(mse);

Open Questions / Notes / Links

  • Is it folly to use fourcc codes for pixel formats? These don’t seem standardized. ffmpeg has one definition and Microsoft another definition.
  • Unfortunately opus and vorbis don’t have specified MPEG1 packetizations – only ogg, mp4, and webm. So if developers want to use demuxed Opus or Vorbis it will need to be in an ISO-BMFF or webm container unless we add an Ogg demuxer to MSE.
  • Do we need stride data for raw formats? Requiring all data to be packed into the visible range may have performance implications. Just having a coded size field and formula for how that maps to plane sizes probably solves most issues.
6 Likes

Hi, thanks for your proposal!

This would be really useful for my ogv.js project which we use at Wikipedia to play WebM and Ogg media in Safari, as well as potential use cases I’ve got in mind for real-time video processing such as frame-accurate compositing in an on-web video editor.

A couple notes:

  • FourCCs are indeed not very standardized. :frowning: An alternative is to use the codec parameter values from MIME types, so eg ‘opus’, ‘vp8’, ‘avc1.42E01E’ etc but these are inconveniently variable-length.
  • I would strongly recommend making a normative list of must-be-supported raw pixel formats, and a way to test for support. (This could be done with MediaSource.isTypeSupported(‘video/raw; codecs=“I422”’) maybe?)
  • Header data is an issue for some codecs, as noted, though when transcoding to raw data (my use case) that won’t be an issue.
  • For raw video, a stride would be nice as it can avoid an extra copy… but only if the planar data is arranged correctly to begin with, which I think it would often not be because of macroblock size not always being even to the number of visible pixel rows. Which leads us to having a stride and a frame height?
  • More importantly, having both display area and visible area would be really nice for raw video, as videos don’t always come in square pixels; when we’re transcoding from media sourced from old MPEG-2 files this often comes up.
1 Like

Thanks!

  • We could definitely require something like I420 if other browsers are on board.
  • I think just having coded size would resolve your stride issues. It would simply mean the decoder allocates a bit more when decoding planar formats. Any decoders would need to be changed to allocate a single allocation it breaks up into planes anyways.
  • Seems like we might need coded size, visible size, and natural size, just like we use internally then. Though, when providing raw frames, it seems you could just set the visible size dependent on your pixel size? So long as we have a coded size that is.

(Repeating your comment about composition with WebCodecs): Yes I had that thought as well. If we end up with standardized definitions for an EncodedPacket and DecodedFrame we could add append methods for arrays of those to MediaSource SourceBuffer objects.

  • It’s possible that has a longer standardization process since it requires new APIs versus an extensions to the byte stream registry. It’s also unclear exactly how a wasm decoder would be able to directly write into a JS object. Possibly the object can use ArrayBufferViews that can point into wasm memory.

I believe the information included here isn’t sufficient to be able to handle all codecs.

Here is the following information that we end up needing to decode raw frame in Gecko.

bool KeyFrame
bool EOS (used for vorbis content to determine for trimming is needed on the decoded output)
uint32_t discardPadding (used to indicate that audio is to be trimmed at the front)
uint64_t presentationTimestamp
uint64_t decodeTimestamp (required for MSE)
CryptoSample cryptoData
Size()
Data()
AlphaSize()
AlphaData()
ExtraData() -> required for H264 content with out of band SPS/PPS
TrackInfo() : either AudioInfo / VideoInfo ; this could be provided in an init segment like.

For video, you need not just the visible size but also the picture size. Picture Size vs Display Size let you determine the aspect ratio

VideoInfo:

DisplaySize
ImageSize ( include cropping info)
StereoMode
Rotation // describe how many degrees the decoded image must be rotated
ColorDepth
ColorSpace
ColorRange // full / limited
FrameRate
AlphaPresent
CodecSpecificConfig (array of bytes)

AudioInfo:

Rate
Channels
ChannelMap
Bit depth
profile
CodecSpecificConfig (array of bytes)

That’s just of top of my mind. There may be more. But it indicates that the proposed format doesn’t catter for all content.

Thanks Jean-Yves! For EOS, discardPadding, decodeTimestamp, and TrackInfo() I believe those could just be inferred during the append() operation. I.e.,

EOS = 0 size buffer.
discardPadding = appendWindow
decodeTimestamp = timestampOffset
TrackInfo = sourceBuffer configuration.

For audio, all of those fields would be limited to whatever can be extract from a .wav container if we take this approach.

The most interesting fields I think you list are alpha, codec specific data, crypto, and display information related. Given these (and I’m sure more in the future), it seems the best approach here would be to standardize some JS structure that could be shared by WebCodecs and some new appendPackets([…]) and appendFrames([…]) methods for MSE.

Can you kindly explain what the significance of the following code?

At Chromium when using a single SourceBuffer with mode set to "sequence" with input from MediaRecorder ("video/webm;codecs=vp8,opus") audio is 6 seconds faster than video, meaning audio concludes playback 6 seconds before video. AFAICT there is no programmatic means at the front-end to slow the audio playback to synchronize with the video playback.

Sequence mode with multiple tracks unfortunately doesn’t work well. I believe the last discussion we had on it was that it should be deprecated – I think you’re even on the bug already :slight_smile: Here’s the discussion on that: https://github.com/w3c/media-source/issues/186

The issue at Chromium is "segments" mode invariably stalls at a waiting event, particularly when the input buffers contain variable video track resolutions. Mozilla implementation does not have that problem.

Re

bytes 4-5    visible width in pixels
bytes 6-7    visible height in pixels

even where are input frames visible width in pixels and visible height in pixels are set if the values are variable within the single SourceBuffer Chromium, Chrome will not display the variable pixel width and pixel height frames. Unless it is specified clearly that output must match input the result will be that implementers can arbitrarily decide to only output a single pixel width and pixel height for the entire track. Would suggest to include language that mandates if variable resolution raw frames are input the decoder and HTML <video> element MUST display the variable resolution frames. Else clearly specify that input MUST be only one value for pixel width and pixel height for the SourceBuffer.

Will not

audio.timestampOffset = firstVideoPresentationTimestamp;

within updateend event throw an exception https://plnkr.co/edit/AevzbZ?p=preview ?

(index):403 Uncaught DOMException: Failed to set the 'timestampOffset' property on 'SourceBuffer': The timestamp offset may not be set while the SourceBuffer's append state is 'PARSING_MEDIA_SEGMENT'.

Unless missing a critical aspect of the code, setting timestampOffset outside of updateend event https://plnkr.co/edit/rUYflW?p=preview and using a single SourceBuffer are each broken at Chromium https://plnkr.co/edit/3gHmkq .

You have to set those fields before an append or remove operation and once started you must wait for the updateend before attempting to modify them again.

You have to set those fields before an append or remove operation and once started you must wait for the updateend before attempting to modify them again.

The point is that the claim that "segments" mode of SourceBuffer works at Chromium, Chrome (https://bugs.chromium.org/p/chromium/issues/detail?id=983777#c25 ; https://bugs.chromium.org/p/chromium/issues/detail?id=992235; et al.) and is the appropriate settings to use for appending variable resolution frames to one or more SourceBuffers, AFAICT, is not demonstrated by any working example officially or informally published by the individuals who cite that "segments" mode will in fact output the expected result at Chromium, Chrome.

Meaning none of the examples “work”. If the assertion is that either of the example will work after minor adjustments, then kindly adjust the code to demonstrate MediaSource "segments" mode working (synchronous video and audio playback to the completion of the input media which should be 40-42 seconds given the input media fragments the linked plnkr’s) specifically at Chromium, Chrome at all (Mozilla browsers do not have the same issue).