CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Yuanhong Chen; Kazuki Shimada; Christian Simon; Yukara Ikemiya; Takashi Shibuya; Yuki Mitsufuji

arXiv:2501.02786·cs.SD·August 7, 2025

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces CCStereo, a novel audio-visual model for binaural audio generation that uses contrastive learning and dynamic normalization to improve spatial accuracy and reduce overfitting to room environments.

Contribution

We propose a new model with an audio-visual conditional normalization layer and contrastive learning to enhance spatial sensitivity in binaural audio generation.

Findings

01

Achieves state-of-the-art accuracy on FAIR-Play and MUSIC-Stereo benchmarks.

02

Effectively utilizes test-time augmentation for improved performance.

03

Enhances spatial detail preservation in binaural audio synthesis.

Abstract

Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Hearing Loss and Rehabilitation

MethodsContrastive Learning