Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos
Yanhui Guo, Xi Zhang, Xiaolin Wu

TL;DR
This paper introduces a deep multi-modality neural network that leverages video, audio, and emotion cues to restore low bit-rate talking head videos, significantly improving visual quality in bandwidth-limited scenarios.
Contribution
It presents a novel deep learning framework that exploits cross-modality correlations for effective video restoration at very low bit rates, compatible with existing standards.
Findings
Significant perceptual quality improvement in compressed videos.
Effective removal of compression artifacts through multi-modality fusion.
Compatible with all existing video compression standards.
Abstract
We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. Such video contents are very common in social media, teleconferencing, distance education, tele-medicine, etc., and often need to be transmitted with limited bandwidth. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts caused by spatial down sampling and quantization. The deep learning approach turns out to be ideally suited for the video restoration task, as the complex non-linear cross-modality correlations are very difficult to model analytically and explicitly. The new method is a video post processor that can significantly boost the perceptual quality of aggressively compressed talking head videos, while being fully compatible with all existing video compression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
