PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie; Wentao Lei; Kai Jiang; Guanjie Huang; Pengfei Zhang; Chunhui Zhang; Fengji Ma; Haoyu He; Han Zhang; Jiangshan He; Jinting Wang; Linghan Fang; Lufei Gao; Orkesh Ablet; Peihua Zhang; Ruolin Hu; Shengyu Li; Weilin Lin; Xiaoyang Feng; Xinyue Yang; Yan Rong; Yanyun Wang; Zihang Shao; Zelin Zhao; Chenxing Li; Shan Yang; Wenfu Wang; Meng Yu; Dong Yu; Li Liu

arXiv:2512.23994·cs.SD·May 19, 2026

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang

PDF

1 Repo

TL;DR

PhyAVBench is a new benchmark and dataset designed to evaluate and improve the physical plausibility of audio-visual generation models, revealing current limitations in modeling real-world physics.

Contribution

It introduces PhyAVBench, the first benchmark focused on audio-physics grounding, along with a new dataset, evaluation paradigm, and metric for physically grounded T2AV, I2AV, and V2A models.

Findings

01

Leading models struggle with fundamental physics phenomena.

02

Current models excel at synchronization but lack physical realism.

03

PhyAVBench exposes critical gaps in physically grounded audio-visual generation.

Abstract

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imxtx/PhyAVBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.