SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Kazuki Shimada; Christian Simon; Takashi Shibuya; Shusuke Takahashi; Yuki Mitsufuji

arXiv:2412.13462·cs.SD·February 5, 2026

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

PDF

Open Access 1 Models

TL;DR

This paper introduces a new benchmark for spatially aligned audio-video generation, including a dataset, an alignment metric, and baseline evaluations, to advance the development of immersive multimodal video synthesis.

Contribution

It establishes the first benchmark for spatially aligned audio-video generation with a dedicated dataset, a novel alignment metric, and baseline method evaluations.

Findings

01

Baseline methods show significant gaps in quality and alignment compared to ground truth.

02

The proposed metric effectively evaluates spatial alignment between audio and video.

03

Benchmarking reveals challenges in achieving high-quality, well-aligned audio-visual synthesis.

Abstract

This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task. We introduce a spatially aligned audio-visual dataset, whose audio and video data are curated based on whether sound events are onscreen or not. We also propose a new alignment metric that aims to evaluate the spatial alignment between audio and video. Then, using the dataset and metric, we benchmark two types of baseline methods: one is based on a joint audio-video generation model, and the other is a two-stage method that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Video-Bench/Video-Bench
model· 1 dl· ♡ 1
1 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing