Cinematic Audio Source Separation Using Visual Cues

Kang Zhang; Suyeon Lee; Arda Senocak; Joon Son Chung

arXiv:2603.26113·cs.MM·March 30, 2026

Cinematic Audio Source Separation Using Visual Cues

Kang Zhang, Suyeon Lee, Arda Senocak, Joon Son Chung

PDF

1 Repo

TL;DR

This paper introduces a novel audio-visual framework for cinematic audio source separation that leverages visual cues and synthetic training data to improve separation quality in films.

Contribution

It presents the first AV-CASS framework using conditional flow matching, along with a synthetic data pipeline and a dedicated visual encoder, trained entirely on synthetic data.

Findings

01

Model generalizes well to real-world cinematic content.

02

Achieves strong performance on synthetic, real-world, and audio-only benchmarks.

03

Utilizes visual context to enhance audio source separation quality.

Abstract

Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://cass-flowmatching.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.