TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

Hengyi Feng; Hao Liang; Mingrui Chen; Bohan Zeng; Meiyi Qiang; Zhengyang Zhao; Zimo Meng; Zeang Sheng; Wentao Zhang

arXiv:2605.07593·cs.CV·May 11, 2026

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

Hengyi Feng, Hao Liang, Mingrui Chen, Bohan Zeng, Meiyi Qiang, Zhengyang Zhao, Zimo Meng, Zeang Sheng, Wentao Zhang

PDF

1 Datasets

TL;DR

TraceAV-Bench is a comprehensive benchmark designed to evaluate multi-hop reasoning and hallucination robustness in long audio-visual videos, revealing significant challenges for current models.

Contribution

It introduces the first benchmark for multi-hop reasoning over long audio-visual content, with a large dataset and detailed evaluation dimensions.

Findings

01

Current models perform poorly on TraceAV-Bench, with the best reaching only 68.29%.

02

Robustness to multimodal hallucination is largely independent of reasoning performance.

03

The dataset contains 2,200 questions over 578 long videos, averaging 3.68 reasoning hops.

Abstract

Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Heinz217/TraceAV-Bench
dataset· 205 dl
205 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.