RADIO: Reference-Agnostic Dubbing Video Synthesis
Dongyeun Lee, Chaewon Kim, Sangjoon Yu, Jaejun Yoo, Gyeong-Moon Park

TL;DR
RADIO is a novel framework for high-quality, synchronized dubbing video synthesis that effectively handles pose, expression, and reference image variations by leveraging audio and reference features with ViT-enhanced decoders.
Contribution
It introduces a reference-agnostic dubbing method that maintains high fidelity and synchronization, even with significant reference image deviations.
Findings
Outperforms state-of-the-art methods in challenging scenarios
Achieves high synchronization without fidelity loss
Robust to pose and expression variations
Abstract
One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images. The key is to modulate the decoder layers using latent space composed of audio and reference features. Additionally, we incorporate ViT blocks into the decoder to emphasize high-fidelity details, especially in the lip region. Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity. Especially in harsh scenarios where the reference frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
RADIO: Reference-Agnostic Dubbing Video Synthesis· youtube
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
