[Proposal] hint attribute on HTMLMediaElement to configure rendering latency

All user agents delay the start of playback until some minimum number of frames are decoded and ready for rendering. This buffer provides an important cushion against playback stalls that might otherwise be caused by intermittent decoder slowness.

It also adds a small amount of latency to start playback. For example, in Chromium this adds roughly 200 milliseconds for common video frame rates.

Objective

Let web authors disable the internal rendering buffer for use cases where sub-second latency is more important than smooth playback. At the same time, un-ship Chromium’s heuristic-based approach that disables the rendering buffer (the heuristics are most often wrong).

Use cases

A good example is cloud gaming. When a user moves their mouse to control the camera or presses a button to jump over an obstacle, even sub-second latency can make the game unplayable. A similar use case would be remote desktop streaming.

Another example is camera streaming applications. Webcams may have highly variable frame rates with occasionally very long frame durations (when nothing about the picture has changed). Buffering several long-duration frames to start playback can add considerable latency to a use case where realtime information is important.

Discouraged uses

Most media streaming sites should prefer smooth playback over sub-second improvements in latency. This includes live streaming sports and TV, where low latency (e.g. 3 seconds) is very important, but sub-second latency is not critical and should not be prioritized over smooth playback.

This is also central to Chromium’s desire to move away from heuristic based buffering behavior. Today, Chromium disables the buffer (currently just for video rendering) whenever the media metadata indicates that duration is unknown/infinite. While this is true of the camera streaming and cloud gaming, the vast majority of such content actually belongs this discouraged use category.

The proposed API aims to highlight this tradeoff and avoid misuse.

Proposed API

enum RenderingBufferHint { "none", "normal" };


partial interface HTMLMediaElement {
    attribute RenderingBufferHint renderingBufferHint = "normal"; 
};

Callers who change the value to “none” are encouraged to set the attribute before triggering the media load algorithm (e.g. before setting the src attribute). This maximizes the player’s opportunity to apply the setting to parts of the stack that cannot easily be changed once loading begins (e.g. OS audio buffer output size).

Example

<script>
  // Here MediaSource represents a cloud gaming stream or a stream from 
  // a secuirty camera. Not shown: MSE buffer management. 
  let mediaSource = new MediaSource();  
  video.renderingBufferHint = "none";
  video.src = window.URL.createObjectURL(mediaSource);
</script>

Links

2 Likes

Wouldn’t another approach be to have an API to supply the first 500ms (say) of video and automatically switch over? (cf. lowsrc for images)

I do see your use case, and wonder also if it’d be better to say whether your application values accuracy over timeliness - a higher level API than talking about buffers (see e.g. IPv6 QoS). The same sort of issue happens with video ’phone calls, where a preference for accuracy can lead to an unacceptable lag accumulating. The buffer isn’t the only part of the solution here - the ability to skip and resync is another, no?

Re: low src, video render buffering doesn’t have the same “switch over” trigger. In both cases the aim is to start painting ASAP, but the video rendering use case wants to maintain that behavior over the lifetime of the element. Its also detached from the quality of the video, which would be managed by the app in MSE cases.

Re: accuracy over timeliness, I’m definitely open to clarity in the naming of this. In our case the tradeoff is more timeliness over smoothness.

For sub 150ms ultra low latency applications as gaming, this seems to be very useful, as it leverages the ability to process/separate the AV information from the raw streams coming from the gaming server and still achieve the latency requirements. Although WebRTC supports the latency requirements, it doesn’t allow any processing/insertion, removal of the server stream before rendering it. That’s the situation which this MSE spec addition gets extremely useful.

We at Twitch would fine this feature very useful as would as we would like to be able to control buffer levels at the application level, this is an important feature as we approach even lower latency streaming for interactive content.

1 Like

Folks at Microsoft (both on the Microsoft Edge team and the Mixer team) are supportive of this proposal as well!

Hey group, sorry for the silence. After discussing more with @jernoble, I’d like to propose a different interface that is more high level.

enum AudioContentHint { "music", "speech", ... };
enum VideoContentHint { "realtime", ... };

partial interface HTMLMediaElement {
    attribute AudioContentHint audioContentHint = "";
};

partial interface HTMLVideoElement {
    attribute AudioContentHint videoContentHint = "";
};

The idea behind this is to be “descriptive” rather than “prescriptive”. The spec text for this would probably read similar to the MediaStreamTrack content hint proposal.

renderingBufferHint = “none” would instead be represented as videoContentHint = “realtime”. This signals that video represents a real-time stream, possibly one that is interactive (e.g. cloud gaming, remote desktop). UAs would consider this signal to be a request to optmize for bare minimum sub-second latency.

For the audio hints “music” and “speech”, @jernoble gave the example of using the hint to select an optimal optimal pitch rate correction algorithm.

Thoughts? Ideas for additional hint strings?

Hello, Chris @chcunningham

My name is Yueshi. I am a research engineer at Twitch. Nik (@npurushe) and I are colleagues.

I have been following this discussion for a little while and I would like to clarify a few things with you.

  1. There are 3 conflicting goals a) minimum latency, b) smooth playback, and c) AV sync.

  2. The 200ms buffer we are talking about here is the buffer of decoded video and audio frames (i.e., video in YUV or RGB format, audio in PCM format), right? If so, how does MSE control the maximum number of YUV/RGB frames decoded before being rendered, since uncompressed video frames can consume a lot of memory?

  3. If the 200ms buffer contains decoded video frames, there are also decoded video frames held in the video decoder’s DPB (decoded picture buffer). For platforms having hardware decoding support, is the size of the DPB controlled by the hardware decoder and are the decoded frames stored in GPU memory? Does that DPB size add extra latency on top of your 200ms buffer? For applications like Cloud Gaming, the overall end-to-end latency budget is <150ms, how do they deal with the delay caused by hardware decoder’s DPB?

  4. How can I map the new APIs you were proposing above to the user experience, for example

  • In order to achieve a) minimum latency, but not b) smooth playback or c) AV sync, how can I configure the video/audio hint?
  • In order to achieve b) smooth playback just for audio, but not a) minimum latency or c) AV sync, how can I configure the video/audio hint?
  • In order to achieve b) smooth playback just for audio and c) AV sync, but not a) minimum latency, how can I configure the video/audio hint?
  • In order to achieve b) smooth playback for both audio and video and c) AV sync, but not a) minimum latency, how can I configure the video/audio hint?

Nice to e-meet you here!