Learning to Highlight Audio by Watching Movies

Chao Huang; Ruohan Gao; J. M. F. Tsang; Jan Kurcius; Cagdas Bilen; Chenliang Xu; Anurag Kumar; Sanjeel Parekh

arXiv:2505.12154·cs.CV·May 20, 2025

Learning to Highlight Audio by Watching Movies

Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh

PDF

Open Access 2 Datasets

TL;DR

This paper introduces a transformer-based multimodal framework for visually-guided acoustic highlighting, using a new movie-based dataset and pseudo-data generation to improve audio-visual harmony in content creation.

Contribution

It presents a novel task of audio highlighting guided by video, along with a new dataset and a pseudo-data generation process for training and evaluation.

Findings

01

Our model outperforms baselines in quantitative metrics.

02

Subjective evaluations favor our approach for audio-visual harmony.

03

Different contextual guidance impacts highlighting effectiveness.

Abstract

Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing