RADIO: Reference-Agnostic Dubbing Video Synthesis

Dongyeun Lee; Chaewon Kim; Sangjoon Yu; Jaejun Yoo; Gyeong-Moon Park

arXiv:2309.01950·cs.CV·November 7, 2023

RADIO: Reference-Agnostic Dubbing Video Synthesis

Dongyeun Lee, Chaewon Kim, Sangjoon Yu, Jaejun Yoo, Gyeong-Moon Park

PDF

Open Access 1 Video

TL;DR

RADIO is a novel framework for high-quality, synchronized dubbing video synthesis that effectively handles pose, expression, and reference image variations by leveraging audio and reference features with ViT-enhanced decoders.

Contribution

It introduces a reference-agnostic dubbing method that maintains high fidelity and synchronization, even with significant reference image deviations.

Findings

01

Outperforms state-of-the-art methods in challenging scenarios

02

Achieves high synchronization without fidelity loss

03

Robust to pose and expression variations

Abstract

One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images. The key is to modulate the decoder layers using latent space composed of audio and reference features. Additionally, we incorporate ViT blocks into the decoder to emphasize high-fidelity details, especially in the lip region. Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity. Especially in harsh scenarios where the reference frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RADIO: Reference-Agnostic Dubbing Video Synthesis· youtube

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis