Binaural Audio Generation via Multi-task Learning

Sijia Li; Shiguang Liu; Dinesh Manocha

arXiv:2109.00748·cs.SD·September 3, 2021

Binaural Audio Generation via Multi-task Learning

Sijia Li, Shiguang Liu, Dinesh Manocha

PDF

Open Access

TL;DR

This paper introduces a multi-task learning approach to generate binaural audio from mono audio by leveraging visual features and auxiliary tasks, improving spatial audio synthesis quality.

Contribution

The novel multi-task learning framework jointly performs binaural audio generation and flipped audio classification using visual features, enhancing spatialization accuracy.

Findings

01

Outperforms prior techniques in quantitative metrics

02

Demonstrates improved spatialization in qualitative evaluations

03

Effective use of visual features from videos

Abstract

We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection