[Proposal] hint attribute on HTMLMediaElement to configure rendering latency

chcunningham · 2019-05-18

All user agents delay the start of playback until some minimum number of frames are decoded and ready for rendering. This buffer provides an important cushion against playback stalls that might otherwise be caused by intermittent decoder slowness.

It also adds a small amount of latency to start playback. For example, in Chromium this adds roughly 200 milliseconds for common video frame rates.

Objective

Let web authors disable the internal rendering buffer for use cases where sub-second latency is more important than smooth playback. At the same time, un-ship Chromium’s heuristic-based approach that disables the rendering buffer (the heuristics are most often wrong).

Use cases

A good example is cloud gaming. When a user moves their mouse to control the camera or presses a button to jump over an obstacle, even sub-second latency can make the game unplayable. A similar use case would be remote desktop streaming.

Another example is camera streaming applications. Webcams may have highly variable frame rates with occasionally very long frame durations (when nothing about the picture has changed). Buffering several long-duration frames to start playback can add considerable latency to a use case where realtime information is important.

Discouraged uses

Most media streaming sites should prefer smooth playback over sub-second improvements in latency. This includes live streaming sports and TV, where low latency (e.g. 3 seconds) is very important, but sub-second latency is not critical and should not be prioritized over smooth playback.

This is also central to Chromium’s desire to move away from heuristic based buffering behavior. Today, Chromium disables the buffer (currently just for video rendering) whenever the media metadata indicates that duration is unknown/infinite. While this is true of the camera streaming and cloud gaming, the vast majority of such content actually belongs this discouraged use category.

The proposed API aims to highlight this tradeoff and avoid misuse.

Proposed API

enum RenderingBufferHint { "none", "normal" };


partial interface HTMLMediaElement {
    attribute RenderingBufferHint renderingBufferHint = "normal"; 
};

Callers who change the value to “none” are encouraged to set the attribute before triggering the media load algorithm (e.g. before setting the src attribute). This maximizes the player’s opportunity to apply the setting to parts of the stack that cannot easily be changed once loading begins (e.g. OS audio buffer output size).

Example

<script>
  // Here MediaSource represents a cloud gaming stream or a stream from 
  // a secuirty camera. Not shown: MSE buffer management. 
  let mediaSource = new MediaSource();  
  video.renderingBufferHint = "none";
  video.src = window.URL.createObjectURL(mediaSource);
</script>

Links

explainer (pasted above)
whatwg issue 4638

liamquin · 2019-05-18

Wouldn’t another approach be to have an API to supply the first 500ms (say) of video and automatically switch over? (cf. lowsrc for images)

I do see your use case, and wonder also if it’d be better to say whether your application values accuracy over timeliness - a higher level API than talking about buffers (see e.g. IPv6 QoS). The same sort of issue happens with video ’phone calls, where a preference for accuracy can lead to an unacceptable lag accumulating. The buffer isn’t the only part of the solution here - the ability to skip and resync is another, no?

chcunningham · 2019-05-30

Re: low src, video render buffering doesn’t have the same “switch over” trigger. In both cases the aim is to start painting ASAP, but the video rendering use case wants to maintain that behavior over the lifetime of the element. Its also detached from the quality of the video, which would be managed by the app in MSE cases.

Re: accuracy over timeliness, I’m definitely open to clarity in the naming of this. In our case the tradeoff is more timeliness over smoothness.

fernando-80 · 2019-06-24

For sub 150ms ultra low latency applications as gaming, this seems to be very useful, as it leverages the ability to process/separate the AV information from the raw streams coming from the gaming server and still achieve the latency requirements. Although WebRTC supports the latency requirements, it doesn’t allow any processing/insertion, removal of the server stream before rendering it. That’s the situation which this MSE spec addition gets extremely useful.

npurushe · 2019-07-05

We at Twitch would fine this feature very useful as would as we would like to be able to control buffer levels at the application level, this is an important feature as we approach even lower latency streaming for interactive content.

scottlow · 2019-07-15

Folks at Microsoft (both on the Microsoft Edge team and the Mixer team) are supportive of this proposal as well!

chcunningham · 2019-09-05

Hey group, sorry for the silence. After discussing more with @jernoble, I’d like to propose a different interface that is more high level.

enum AudioContentHint { "music", "speech", ... };
enum VideoContentHint { "realtime", ... };

partial interface HTMLMediaElement {
    attribute AudioContentHint audioContentHint = "";
};

partial interface HTMLVideoElement {
    attribute AudioContentHint videoContentHint = "";
};

The idea behind this is to be “descriptive” rather than “prescriptive”. The spec text for this would probably read similar to the MediaStreamTrack content hint proposal.

renderingBufferHint = “none” would instead be represented as videoContentHint = “realtime”. This signals that video represents a real-time stream, possibly one that is interactive (e.g. cloud gaming, remote desktop). UAs would consider this signal to be a request to optmize for bare minimum sub-second latency.

For the audio hints “music” and “speech”, @jernoble gave the example of using the hint to select an optimal optimal pitch rate correction algorithm.

Thoughts? Ideas for additional hint strings?

yshen · 2019-09-12

Hello, Chris @chcunningham

My name is Yueshi. I am a research engineer at Twitch. Nik (@npurushe) and I are colleagues.

I have been following this discussion for a little while and I would like to clarify a few things with you.

There are 3 conflicting goals a) minimum latency, b) smooth playback, and c) AV sync.
The 200ms buffer we are talking about here is the buffer of decoded video and audio frames (i.e., video in YUV or RGB format, audio in PCM format), right? If so, how does MSE control the maximum number of YUV/RGB frames decoded before being rendered, since uncompressed video frames can consume a lot of memory?
If the 200ms buffer contains decoded video frames, there are also decoded video frames held in the video decoder’s DPB (decoded picture buffer). For platforms having hardware decoding support, is the size of the DPB controlled by the hardware decoder and are the decoded frames stored in GPU memory? Does that DPB size add extra latency on top of your 200ms buffer? For applications like Cloud Gaming, the overall end-to-end latency budget is <150ms, how do they deal with the delay caused by hardware decoder’s DPB?
How can I map the new APIs you were proposing above to the user experience, for example

In order to achieve a) minimum latency, but not b) smooth playback or c) AV sync, how can I configure the video/audio hint?
In order to achieve b) smooth playback just for audio, but not a) minimum latency or c) AV sync, how can I configure the video/audio hint?
In order to achieve b) smooth playback just for audio and c) AV sync, but not a) minimum latency, how can I configure the video/audio hint?
In order to achieve b) smooth playback for both audio and video and c) AV sync, but not a) minimum latency, how can I configure the video/audio hint?

Nice to e-meet you here!

chcunningham · 2019-09-22

Hi Yueshi, apologies for the delay. Nice to meet you as well. Great questions!

FYI, we just discussed this at TPAC and resolved to go back to a more narrow latency hint like I originally proposed. Safari also noted either latency proposal is difficult for them to implement given their limited control over the platform media player library.

There are 3 conflicting goals a) minimum latency, b) smooth playback, and c) AV sync.

Latency and smoothness are definitely competing. Starting with lower latency increases the odds of underflow (decoded media not ready in time to be rendered).

AV sync is more nuanced - it may be affected, but this is not not prescribed by any spec. Each UA has slightly different behavior to cope with underflow. In Chrome, the strategy is as follows for both underflows caused by slow decoders as well as slow network:

Say you have audio+video in a single <video> tag. If just audio underflows, chrome will immediately pause playback. AV sync is never lost, but the interuption is clearly noticeable. If just video underflows, chrome will let audio keep playing for 3 seconds (breaking av sync). Most often video catches back up (the user may not even notice), but after 3 seconds chrome will pause both tracks to rebuffer.
Say you instead have separate <audio> and <video> tags. Either tag can underflow without affecting the other. But here you have to manage AV sync entirely on your own (polling currentTime and making playbackRate adjustments to re-sync as needed). I’ll refer to this option as decoupled underflow in the text below.

Note again, the above is Chrome specific. Mozilla did express interest in standardizing underflow behavior between browsers. Very early stage discussion.

The 200ms buffer we are talking about here is the buffer of decoded video and audio frames (i.e., video in YUV or RGB format, audio in PCM format), right?

Correct. The 200ms number actually comes from Chrome’s decoded audio frame buffer. The video frame buffer is ~3 frames, so generally the shorter pole for common framerates.

If so, how does MSE control the maximum number of YUV/RGB frames decoded before being rendered, since uncompressed video frames can consume a lot of memory?

For now the size of the decoded frame buffer is up to the UA (each does it differently). MSE has no control.

With this proposal, setting renderBufferingHint = "none" would cause us to start playback as soon as we have a single frame of video and just enough audio to fill the platform playout buffer without glitching.

This proposal does not include a way to hint at a larger / custom buffering size. Perhaps it should. It may be that sites want something more than the bare minimum, but not as much as the default. Or perhaps they even want more than the default. I’ll give this some thought - lmk if it interests you. In the end, its still a hint - the UA would get final say since some low values are impossible and some high values would require too much memory.

For platforms having hardware decoding support, is the size of the DPB controlled by the hardware decoder

No. The DPB is after the decoder.

DPB aside, some hw decoders may increase latency if they require several inputs to produce the first output. I’d have to double check how prevalent that is.

are the decoded frames stored in GPU memory?

Often yes, but this shouldn’t affect startup latency.

For applications like Cloud Gaming, the overall end-to-end latency budget is <150ms, how do they deal with the delay caused by hardware decoder’s DPB?

At present (having not shipped this proposal), its pretty tricky and fairly UA specific. You can put Chrome’s video renderer in low-delay mode by using MSE to append a video that has unkown duration in its container metadata. You can work around the audio renderer buffer by using WebAudio. You can also abandon MSE for WebRTC, but this requires significant changes for the site and media server.

How can I map the new APIs you were proposing above to the user experience…

Let me reorder and lightly edit your questions to build up from the default case.

In order to achieve b) smooth playback for both audio and video and c) AV sync, but not a) minimum latency…

This is the default today without a hint. Simply combine audio+video into a single <video> tag. The UA will manage AV sync and the default buffering size will optimize for smoothness instead of latency (assuming you’re not triggering chrome’s existing low-delay mode heuristic).

In order to achieve a) minimum latency, but not b) smooth playback nor c) AV sync, how can I configure the video/audio hint?

Use separate <audio> <video> tags. Set renderBufferingHint = "none" on both. No UA will attempt to manage av sync for the separate tags.

As mentioned before, separating the tags will decouple underflow behavior. But this also means underflow can really wreck AV sync if you don’t make manual adjustments. See next question…

How can I achieve minimum latency (less smoothness) for audio and video and still keep AV sync?

The easy route is to combine audio+video into a single <video> tag and set .renderBufferingHint = “none”. Playback will start as soon as both tracks have a minimum amount of data decoded and the UA will manage AV sync for you. BUT, the tracks will also have coupled underflow behavior.

If you want de-coupled underflow behavior, you would have to use separate tags and do AV sync manually with initial (and periodic) playbackRate adjustments to catch up a track that falls behind.

Can I acheive minimum latency for just video?

Yes. Again use separate tags, but only set the hint on video. Use playbackRate adjustments if you want AV sync.

Chris

yshen · 2019-09-23

Thank you very much, Chris. Love your feedback and enjoy our deep technical discussion.

I think I understand and agree almost everything you wrote and I have a few follow-up comments:

Let’s synchronize on our understanding about the general playback pipeline and its latency.

Here is a diagram of the playback pipeline, and our goal is to reduce the “Total A/V processing latency”. Is my understanding correct?

May we clarify our understandings on DPB?

[YS] For platforms having hardware decoding support, is the size of the DPB controlled by the hardware decoder

[CC] No. The DPB is after the decoder.

DPB aside, some hw decoders may increase latency if they require several inputs to produce the first output. I’d have to double check how prevalent that is.

[YS] are the decoded frames stored in GPU memory?

[CC] Often yes, but this shouldn’t affect startup latency.

[YS] The 200ms buffer we are talking about here is the buffer of decoded video and audio frames (i.e., video in YUV or RGB format, audio in PCM format), right?

[CC] Correct. The 200ms number actually comes from Chrome’s decoded audio frame buffer. The video frame buffer is ~3 frames, so generally the shorter pole for common framerates.

It sounds like you thought I was using DPB to reference the component 5.2 (the “~3 frames” in your words). Acutally, I meant the number of the decoded frames (used as the reference frames for decoding later frames) held by the low-level decoder (i.e., component 4.2 in the above diagram, which you talked about with “some hw decoders may increase latency if they require several inputs to produce the first output”).

I found an interesting article that describes the DPB management in low-level H.264 decoder https://www.vcodex.com/h264avc-picture-management/

With the current MSE behavior, what is the minimum total A/V processing latency?

As long as the audio PCM buffer fills to 200ms, the playback will start. Does MSE player check whether there are indeed at least 3 video frames in the video YUV buffer (component 5.2)? If not, should the 200ms be larger than the delay of both DPB (component 4.2) and video YUV buffer (component 5.2), otherwise the video might not be ready before the audio playback starts?

With the current MSE behavior, what will happen if the input of the pipeline pauses temporarily, i.e., the audio PCM buffer (component 5.1) drains and refills?

For the cloud gaming use case, after rebuffer, I guess the audio stream will be programmed to skip to the present time (to avoid accumulating extra latency). However, there are still old decoded frames buffered in the low-level video decoder (component 4.2). What is mechanism of handling these obsolete video frames, still display them or discard them?

Ultimately, the size of DPB affects video playback’s smoothness. If the DPB size is indeed large (e.g., 10 frames, i.e., 167ms), there is only a 33ms buffer left for video to be rendered at a regular pace.

With your new proposal, now the audio PCM buffer (component 5.1) becomes tiny, what is the mechanism to absorb the network jitter or the fluctuation of video frame size (i.e., video frames’ arrival time is not regular) to achieve smooth video playback?

I think the question really is: what buffer MSE can offer if it’s not a fixed 200ms one? Can it be an elastic one (like WebRTC Media has implemented) based on the frames’ arrival time intervals? If we just remove this buffer completely, will we still face the delay brought by the DPB, as well as the new stuttering of video playback due to network jitter or frame size difference (e.g., I-frame a lot larger than P-frame).

chcunningham · 2019-09-25

Happy to discuss! Quick note: all that follows is chrome specific.

Let’s synchronize on our understanding about the general playback pipeline and its latency.

Flow of the data looks good. Just some thoughts on latency for the different components

SourceBuffer (1): I haven’t personally tested this, but I expect the source buffer itself shouldn’t add much of latency unless the main thread is busy doing lots of other work (which I guess is possible on sites like youtube, twitch, etc where you have comments feeds and lots of other things to render). But maybe you’re just noting the network variability and/or the small buffer you build up with each append chunk - both definitely contribute latency.

Audio/video compressed frame buffer (3.1, 3.2): we don’t keep a buffer of compressed frames in a way that would add any latency. Some demuxers keep an internal buffer, but this is just an optimization to have it ready in advance of being asked for it - we don’t block anything on getting those buffers to a certain capacity. Generally any read from the demuxer is satisfied in about 0ms unless the demuxer runs out of data (for MSE this would mean the app failed to append in time).

VideoDecoder (4.2): we do have a re-order (DPB) queue for AVC, but this only adds latency for streams that that use re-ordering (B frames). We recommend not using B-frames in any latency sensitive app. Also, I checked with Chrome decoder folks and confirmed that none of the modern video codecs have the requirement I mentioned of needing more than one frame to produce a first output. So the only unavoidable video decoder latency is from the decode itself. This will vary depending on the system, but its obviously < 16.6666ms for any computer that can otherwise handle 60fps. If you find the machine is dropping lots of frames (can’t handle), adapt down to a lower resolution/framerate and resort to playbackRate adjustments to catch them back up to the live edge.

May we clarify our understandings on DPB?

You’re right, I was confused earlier. See comments on VideoDecoder above.

With the current MSE behavior, what is the minimum total A/V processing latency?

If you combine audio+video into a tag (e.g. MSE w/ 2 SourceBuffers) we will block playback until both tracks have met their buffer requirements (200ms for audio, ~3 frames for video). This avoids the pitfall you mentioned where you start with only one track ready.

Also, note that this isn’t MSE specific. Our video pipeline is the same for MSE and src=file.webm type playbacks - the only difference being the demuxer.

With the current MSE behavior, what will happen if the input of the pipeline pauses temporarily, i.e., the audio PCM buffer (component 5.1) drains and refills?

We call this an “underflow”. If has both audio+video tracks, we would immediately freeze both and emit the “waiting” event and .readyState would change to “HAVE_CURRENTDATA”. The isn’t officially “paused”, but is instead rebuffering and will automatically resume when the queue fills back up (i.e. UI would show a spinner, not a pause icon). See my earlier post for other details on de-coupled vs coupled underflow.

However, there are still old decoded frames buffered in the low-level video decoder (component 4.2). What is mechanism of handling these obsolete video frames, still display them or discard them?

When data arrives to resolve the underflow we will resume playing from the position we froze at. You can fast fwd to current time with a playbackRate adjustment, or if a lot of time has passed you can issue a seek by setting .currentTime to the live media time. Will require experimentation to determine a threshold for which approach to use.

Ultimately, the size of DPB affects video playback’s smoothness. If the DPB size is indeed large (e.g., 10 frames, i.e., 167ms), there is only a 33ms buffer left for video to be rendered at a regular pace.

DPB size will be zero if you disable B-frames.

With your new proposal, now the audio PCM buffer (component 5.1) becomes tiny, what is the mechanism to absorb the network jitter or the fluctuation of video frame size (i.e., video frames’ arrival time is not regular) to achieve smooth video playback? I think the question really is: what buffer MSE can offer if it’s not a fixed 200ms one? Can it be an elastic one (like WebRTC Media has implemented) based on the frames’ arrival time intervals?

This proposal will only reduce the amount of buffer we require to start playback. We will still keep reading from the demuxer as long as there is data in an attempt to fill the 200ms/3-frame queues after playback begins. Its fine if the site never appends enough for those queues to fill up (say you only ever append 100ms beyond currentTime, which is still marching forward) - it obviously increases the risk of underflow, but now the site has control over how much risk they’re taking.

I’m not really familiar with WebRTC’s model. Can you send me some reading material?

yshen · 2019-09-27

Hey, Chris

This discussion is really great. I learned so much from it. By the way, are you attending FOMS? If so, we can meet in person at FOMS.

I think our understandings are converging quickly and may I have a few more comments:

Say you have audio+video in a single <video> tag. If just audio underflows, chrome will immediately pause playback. AV sync is never lost, but the interuption is clearly noticeable. If just video underflows, chrome will let audio keep playing for 3 seconds (breaking av sync). Most often video catches back up (the user may not even notice), but after 3 seconds chrome will pause both tracks to rebuffer.

If you combine audio+video into a tag (e.g. MSE w/ 2 SourceBuffers) we will block playback until both tracks have met their buffer requirements (200ms for audio, ~3 frames for video). This avoids the pitfall you mentioned where you start with only one track ready.

So the minimum latency after startup is max(200ms, the DPB size + 3 video frames), and is 200ms for the non-B frame case because the DPB size is so small.

We recommend not using B-frames in any latency sensitive app.

How does the hardware decoders detect a bitstream doesn’t (and won’t) have B-frame? Do they do it based on some flag in SPS or profile? Dynamic GOP structure is quite common in encoders, e.g., they will choose IPPP when the video content has a lot of motion (i.e., temporal prediction doesn’t work too well) and choose IPBB or even pyramid-B (which has even longer reorder delay) when the scene becomes more static.

You can fast fwd to current time with a playbackRate adjustment, or if a lot of time has passed you can issue a seek by setting .currentTime to the live media time.

After the playback starts or resumes, a play algorithm can increase the playback speed skillfully to reduce the latency to less than 200ms, is it a good idea? Alternatively, the algorithm can skip forward to reduce the latency. Does skipping forward work at this granularity (e.g., seek forward for 30ms)?

This proposal will only reduce the amount of buffer we require to start playback.

What is the minimum audio buffer size and minimum video buffer size in the new proposal? Do you plan to make both of them 0 (basically as soon as an audio/video frame comes in, it will be rendered)?

If that’s the case, the MSE doesn’t really force any buffer to absorb the network and video frame size fluctuation, and it will be the player algorithm’s responsibility to adjust playback speed or seek forward in order to maintain a reasonable buffer size which is a good tradeoff between latency and buffering rate. Is my understanding correct?

WebRTC has a mechanism to delay or speed up the decoding time of a video frame, which basically tries to solve the same problem (https://webrtc.googlesource.com/src/+/refs/heads/master/modules/video_coding/jitter_estimator.cc#117). I need to do more study and will provide more details.

cwilso · 2019-09-27

Repository created for incubation: https://github.com/WICG/media-latency-hint

chcunningham · 2019-10-01

Repository created for incubation: https://github.com/WICG/media-latency-hint

Thank you! I’ll upload an explainer in the coming days.

I’d like to propose another revision to the API shape. Here’s some IDL:

partial interface HTMLMediaElement {
    // A hint in (partial) seconds for how long to buffer before 
    // beginning playback (both initially after after a seek).
    attribute double latencyHint;
};

Seconds is a strange unit (no one is expected to want several seconds of latency), but this is the unit elsewhere in HTMLMediaElement and is still precise enough for our purposes here.

Compared to the initial proposal, the equivalent of renderingBufferHint = “none” would now be latencyHint = 0. Obviously we can’t achieve perfect zero, but this is a valid way for a site to express “bare minimum”.

Providing no hint would imply we maintain the current default behavior, optimizing for playback smoothness by buffering a UA-specific amount (chrome ~200msec) before starting playback.

The power of this proposal is the in-between states. Sites may want less than the default but more than the bare minimum (e.g. latencyHint = 0.1, or 100 msec). This is actually easier for implementers as well. With the previous proposal (could only express “bare minimum”), we know the risk of underflow from decoder hiccups would be higher, but we (implementers) are forced to choose how close to zero is still usable. Letting sites choose saves us this headache and lets sites make site-specific decisions.

Some sites like traditional video streaming sites may even want more than the default. They could set latencyHint = 0.5 or even 1.0. As the name suggests, it’s still a hint. UAs will ultimately decide a maximum buffer size to avoid using crazy amounts of memory. Values like these would suggest that the UA should exceed its default and do everything possible to trade latency for smoother playback.

Yueshi and I spoke offline. We discussed this latest proposal and his questions from earlier. Summarizing my answers below for folks reading along.

How does the hardware decoders detect a bitstream doesn’t (and won’t) have B-frame? Do they do it based on some flag in SPS or profile? Dynamic GOP structure is quite common in encoders, e.g., they will choose IPPP when the video content has a lot of motion (i.e., temporal prediction doesn’t work too well) and choose IPBB or even pyramid-B (which has even longer reorder delay) when the scene becomes more static.

First a correction: We manage the DPB for most of our hw decoders. The notable exceptions are: DXVA decoder on windows and MediaCodec on Android. For DXVA, we do always enable CODECAPI_AVLowLatencyMode. Also, we are slowly ramping up a new D3D11 accelerator on windows, a lower level API where we again manage the DPB. D3D11 support will vary.

And another correction: While streams can avoid a deep DPB, I’ve learned its not simply a matter of avoiding b-frames. Instead, there are flags in the bitstream to tell the decoder you won’t re-order (or how restricted you’re re-ordering will be). List of options:

use VUI to set max number of re-order frames
choose AVC profiles that don’t re-order
set mmco5 on frames to signal when a flush of DPB is desired

There may be other ways, this is just what I’m aware of so far. Of these, #1 and #2 may be most broadly (implemented here for our mac decoder, and here for chromeos + d3d11). 3 is implemented on mac - didn’t see it in the other file (but we could add it).

Apologies for the misdirections earlier (and thank you for asking great questions).

After the playback starts or resumes, a play algorithm can increase the playback speed skillfully to reduce the latency to less than 200ms, is it a good idea? Alternatively, the algorithm can skip forward to reduce the latency. Does skipping forward work at this granularity (e.g., seek forward for 30ms)?

Temporarily increasing the playback speed should work well. Users may notice depending on the values used. Also, you can over-do it such that you run up against the edge of what you have buffered. It will be up to the site to decide how close to the edge it should go (probably involves some bandwidth monitoring etc).

Seeking may be a good idea if the player needs to skip over a large section - it may be less noticeable to the user than a prolonged playback rate adjustment. Seeking over very short intervals may not perform as well as a quick playback rate adjustment (note: seeking always pre-rolls the decoder from an earlier iframe - we throw away frames up to the seek point, but we still have to decode them). Seeking will also be necessary if you need to skip over a gap in the video’s buffered range (say some segment failed to download).

What is the minimum audio buffer size and minimum video buffer size in the new proposal? Do you plan to make both of them 0 (basically as soon as an audio/video frame comes in, it will be rendered)?

For video alone we could start rendering as soon as the first frame is ready. We would render any later frame once its presentation time had been reached. For Audio or Audio+Video, we need to buffer some audio to deliver chunks to the platform audio render buffer. For Windows I believe the platform buffer can be configured to as small as 10ms, so we might buffer around 2 - 3x that to reliably avoid audio glitches. This requires some experimenting on our part - will let you know.

If that’s the case, the MSE doesn’t really force any buffer to absorb the network and video frame size fluctuation, and it will be the player algorithm’s responsibility to adjust playback speed or seek forward in order to maintain a reasonable buffer size which is a good tradeoff between latency and buffering rate. Is my understanding correct?

Correct. The site’s player alone decides how much to buffer ahead in MSE and can use playback rate and seeking to catch up if things fall behind.

yshen · 2019-10-01

Thank you very much, Chris!

I second the proposal of

partial interface HTMLMediaElement {
    attribute double latencyHint;
};

, which allows content platform/player developers to choose their preferred latency/buffering rate tradeoff depending on their user experience choice, backend (transcoder) behavior, and any particular user’s network condition.

kuddai · 2019-10-21

Hello, In mixer as far as I can see, you are using WebRTC for streaming. As closest proxy, would you be interested in this similar Blink API/Experiment? https://bit.ly/2P6R2SK

It offers similar hint, playoutDelayHint, on WebRTC RTCRtpReceiver object

partial interface RTCRtpReceiver {
  attribute double? playoutDelayHint;
}

which allows you to influence the size of audio/video buffers. The default preferences in WebRTC is render audio/video as fast as possible given that network connection is good which makes it low latency and suitable for realtime communication. However, at the same time it makes it more sensitive to any sort of network jitter, in contrast to standard video playout where it is not rare to buffer content several seconds in advance. Probably for streaming, adding extra 200~500 milliseconds of buffering wouldn’t hurt realtime part of the experience but make quality more resilient to network issues.

yolostreet · 2019-10-21

I support this proposal. However the term “latencyHint” is terribly overloaded in the media space. Latency refers to the RTT latency of the client-server connection, it also refers to the live stream latency from encoder to player display. The latter two are the more conventional interpretations. I would suggest changing the name of the attribute to avoid interpretation collisions. Possible improvements could be “startingBuffer”, or “playoutStartingBuffer” etc.

Cheers Will