Visual-based spatial audio generation system for multi-speaker   environments

Xiaojing Liu; Ogulcan Gurelli; Yan Wang; and Joshua Reiss

arXiv:2502.07538·cs.MM·February 14, 2025

Visual-based spatial audio generation system for multi-speaker environments

Xiaojing Liu, Ogulcan Gurelli, Yan Wang, and Joshua Reiss

PDF

Open Access

TL;DR

This paper introduces an automated visual-based system for generating spatial audio in multi-speaker environments, improving audio-visual alignment and speech quality without additional dataset training.

Contribution

The system integrates face detection, depth estimation, and spatial audio techniques to automate and enhance spatial audio generation in multimedia applications.

Findings

01

Significantly improves spatial consistency between audio and video

02

Enhances speech quality in multi-speaker scenarios

03

Operates without additional binaural dataset training

Abstract

In multimedia applications such as films and video games, spatial audio techniques are widely employed to enhance user experiences by simulating 3D sound: transforming mono audio into binaural formats. However, this process is often complex and labor-intensive for sound designers, requiring precise synchronization of audio with the spatial positions of visual components. To address these challenges, we propose a visual-based spatial audio generation system - an automated system that integrates face detection YOLOv8 for object detection, monocular depth estimation, and spatial audio techniques. Notably, the system operates without requiring additional binaural dataset training. The proposed system is evaluated against existing Spatial Audio generation system using objective metrics. Experimental results demonstrate that our method significantly improves spatial consistency between audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing