Hi Yueshi, apologies for the delay. Nice to meet you as well. Great questions!
FYI, we just discussed this at TPAC and resolved to go back to a more narrow latency hint like I originally proposed. Safari also noted either latency proposal is difficult for them to implement given their limited control over the platform media player library.
There are 3 conflicting goals a) minimum latency, b) smooth playback, and c) AV sync.
Latency and smoothness are definitely competing. Starting with lower latency increases the odds of underflow (decoded media not ready in time to be rendered).
AV sync is more nuanced - it may be affected, but this is not not prescribed by any spec. Each UA has slightly different behavior to cope with underflow. In Chrome, the strategy is as follows for both underflows caused by slow decoders as well as slow network:
- Say you have audio+video in a single
<video> tag. If just audio underflows, chrome will immediately pause playback. AV sync is never lost, but the interuption is clearly noticeable. If just video underflows, chrome will let audio keep playing for 3 seconds (breaking av sync). Most often video catches back up (the user may not even notice), but after 3 seconds chrome will pause both tracks to rebuffer.
- Say you instead have separate
<video> tags. Either tag can underflow without affecting the other. But here you have to manage AV sync entirely on your own (polling currentTime and making playbackRate adjustments to re-sync as needed). I’ll refer to this option as decoupled underflow in the text below.
Note again, the above is Chrome specific. Mozilla did express interest in standardizing underflow behavior between browsers. Very early stage discussion.
The 200ms buffer we are talking about here is the buffer of decoded video and audio frames (i.e., video in YUV or RGB format, audio in PCM format), right?
Correct. The 200ms number actually comes from Chrome’s decoded audio frame buffer. The video frame buffer is ~3 frames, so generally the shorter pole for common framerates.
If so, how does MSE control the maximum number of YUV/RGB frames decoded before being rendered, since uncompressed video frames can consume a lot of memory?
For now the size of the decoded frame buffer is up to the UA (each does it differently). MSE has no control.
With this proposal, setting
renderBufferingHint = "none" would cause us to start playback as soon as we have a single frame of video and just enough audio to fill the platform playout buffer without glitching.
This proposal does not include a way to hint at a larger / custom buffering size. Perhaps it should. It may be that sites want something more than the bare minimum, but not as much as the default. Or perhaps they even want more than the default. I’ll give this some thought - lmk if it interests you. In the end, its still a hint - the UA would get final say since some low values are impossible and some high values would require too much memory.
For platforms having hardware decoding support, is the size of the DPB controlled by the hardware decoder
No. The DPB is after the decoder.
DPB aside, some hw decoders may increase latency if they require several inputs to produce the first output. I’d have to double check how prevalent that is.
are the decoded frames stored in GPU memory?
Often yes, but this shouldn’t affect startup latency.
For applications like Cloud Gaming, the overall end-to-end latency budget is <150ms, how do they deal with the delay caused by hardware decoder’s DPB?
At present (having not shipped this proposal), its pretty tricky and fairly UA specific. You can put Chrome’s video renderer in low-delay mode by using MSE to append a video that has unkown duration in its container metadata. You can work around the audio renderer buffer by using WebAudio. You can also abandon MSE for WebRTC, but this requires significant changes for the site and media server.
How can I map the new APIs you were proposing above to the user experience…
Let me reorder and lightly edit your questions to build up from the default case.
In order to achieve b) smooth playback for both audio and video and c) AV sync, but not a) minimum latency…
This is the default today without a hint. Simply combine audio+video into a single
<video> tag. The UA will manage AV sync and the default buffering size will optimize for smoothness instead of latency (assuming you’re not triggering chrome’s existing low-delay mode heuristic).
In order to achieve a) minimum latency, but not b) smooth playback nor c) AV sync, how can I configure the video/audio hint?
<audio> <video> tags. Set
renderBufferingHint = "none" on both. No UA will attempt to manage av sync for the separate tags.
As mentioned before, separating the tags will decouple underflow behavior. But this also means underflow can really wreck AV sync if you don’t make manual adjustments. See next question…
How can I achieve minimum latency (less smoothness) for audio and video and still keep AV sync?
The easy route is to combine audio+video into a single
<video> tag and set .renderBufferingHint = “none”. Playback will start as soon as both tracks have a minimum amount of data decoded and the UA will manage AV sync for you. BUT, the tracks will also have coupled underflow behavior.
If you want de-coupled underflow behavior, you would have to use separate tags and do AV sync manually with initial (and periodic) playbackRate adjustments to catch up a track that falls behind.
Can I acheive minimum latency for just video?
Yes. Again use separate tags, but only set the hint on video. Use playbackRate adjustments if you want AV sync.