CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

TL;DR
This paper introduces CCStereo, a novel audio-visual model for binaural audio generation that uses contrastive learning and dynamic normalization to improve spatial accuracy and reduce overfitting to room environments.
Contribution
We propose a new model with an audio-visual conditional normalization layer and contrastive learning to enhance spatial sensitivity in binaural audio generation.
Findings
Achieves state-of-the-art accuracy on FAIR-Play and MUSIC-Stereo benchmarks.
Effectively utilizes test-time augmentation for improved performance.
Enhances spatial detail preservation in binaural audio synthesis.
Abstract
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Hearing Loss and Rehabilitation
MethodsContrastive Learning
