Seeing Voices: Generating A-Roll Video from Audio with Mirage

Aditi Sundararaman; Amogh Adishesha; Andrew Jaegle; Dan Bigioi; Hyoung-Kyu Song; Jon Kyl; Justin Mao; Kevin Lan; Mojtaba Komeili; ShahRukh Athar; Sheila Babayan; Stanislau Beliasau; William Buchwalter

arXiv:2506.08279·cs.CV·June 11, 2025

Seeing Voices: Generating A-Roll Video from Audio with Mirage

Aditi Sundararaman, Amogh Adishesha, Andrew Jaegle, Dan Bigioi, Hyoung-Kyu Song, Jon Kyl, Justin Mao, Kevin Lan, Mojtaba Komeili, ShahRukh Athar, Sheila Babayan, Stanislau Beliasau, William Buchwalter

PDF

Open Access

TL;DR

Mirage is a novel foundation model that generates realistic, expressive video of talking people directly from audio, integrating audio and visual modalities for high-quality multimodal content creation.

Contribution

The paper introduces Mirage, a unified self-attention-based framework for audio-to-video generation that outperforms existing methods in quality and flexibility.

Findings

01

Generated videos are highly realistic and expressive.

02

Mirage outperforms previous methods in subjective quality.

03

The approach is versatile for various audio-to-video tasks.

Abstract

From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video's audio track) with what we see (the video's image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis

MethodsFocus