TL;DR
This paper introduces a novel audio-visual framework for cinematic audio source separation that leverages visual cues and synthetic training data to improve separation quality in films.
Contribution
It presents the first AV-CASS framework using conditional flow matching, along with a synthetic data pipeline and a dedicated visual encoder, trained entirely on synthetic data.
Findings
Model generalizes well to real-world cinematic content.
Achieves strong performance on synthetic, real-world, and audio-only benchmarks.
Utilizes visual context to enhance audio source separation quality.
Abstract
Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
