Exploiting Audio-Visual Consistency with Partial Supervision for Spatial   Audio Generation

Yan-Bo Lin; Yu-Chiang Frank Wang

arXiv:2105.00708·cs.SD·May 4, 2021

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Yan-Bo Lin, Yu-Chiang Frank Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a self-supervised framework that converts monaural videos into binaural audio by leveraging audio-visual consistency, enhancing spatial audio experience without extensive ground truth data.

Contribution

It proposes a novel semi-supervised learning approach that exploits audio-visual relationships to generate spatial audio from monaural recordings, reducing reliance on large labeled datasets.

Findings

01

Effective in semi-supervised and fully supervised settings

02

Improves spatial audio quality in benchmark tests

03

Visualization confirms accurate audio-visual consistency

Abstract

Human perceives rich auditory experience with distinct sound heard by ears. Videos recorded with binaural audio particular simulate how human receives ambient sound. However, a large number of videos are with monaural audio only, which would degrade the user experience due to the lack of ambient information. To address this issue, we propose an audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components. By preserving the left-right consistency in both audio and visual modalities, our learning strategy can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth binaural audio data during training. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation· underline

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation