TL;DR
This paper presents a novel audio-visual upsampling network that transforms extremely low-resolution videos into high-quality, full-resolution talking-face videos, significantly improving super-resolution quality and video compression efficiency.
Contribution
The paper introduces an end-to-end multi-stage framework leveraging audio and image priors for extreme-scale talking-face video upsampling, achieving unprecedented scale and quality improvements.
Findings
Achieves 32x scaling from 8x8 to 256x256 resolution.
Improves FID score by 8x over previous super-resolution methods.
Provides a 3.5x bit/pixel reduction in talking-face video compression.
Abstract
In this paper, we explore an interesting question of what can be obtained from an pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this video with the right set of audio and image priors, we can obtain a full-length, video. We achieve this scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLow-resolution input
