Conditional Flow Matching for Visually-Guided Acoustic Highlighting

Hugo Malard; Gael Le Lan; Daniel Wong; David Lou Alon; Yi-Chiao Wu; Sanjeel Parekh

arXiv:2602.03762·eess.AS·February 5, 2026

Conditional Flow Matching for Visually-Guided Acoustic Highlighting

Hugo Malard, Gael Le Lan, Daniel Wong, David Lou Alon, Yi-Chiao Wu, Sanjeel Parekh

PDF

Open Access

TL;DR

This paper introduces a generative framework called Conditional Flow Matching for visually-guided acoustic highlighting, effectively aligning audio with video by addressing ambiguity and improving over previous discriminative methods.

Contribution

The paper proposes a novel generative approach with a rollout loss and cross-modal conditioning for improved audio-visual alignment in acoustic highlighting.

Findings

01

Outperforms previous discriminative methods in accuracy

02

Stabilizes long-range flow trajectories with rollout loss

03

Effectively fuses audio and visual cues for source selection

Abstract

Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis