VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

Alejandro Aparcedo; Akash Kumar; Aaryan Garg; Dalton Pham; Wen-Kai Chen; Anirudh Bharadwaj; Aman Chadha; Yogesh Rawat

arXiv:2605.01391·cs.CV·May 5, 2026

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen, Anirudh Bharadwaj, Aman Chadha, Yogesh Rawat

PDF

TL;DR

VISTA is a comprehensive benchmark for evaluating vision-language models on complex, multi-entity, multi-action video understanding, addressing limitations of existing simple-action benchmarks.

Contribution

VISTA introduces a large-scale, interaction-aware diagnostic benchmark with a unified taxonomy for detailed spatio-temporal analysis of VLMs.

Findings

01

Evaluated 11 state-of-the-art VLMs on VISTA, revealing specific shortcomings.

02

Decomposed videos into entities, actions, and relations for detailed diagnostics.

03

Identified spatio-temporal biases in current models.

Abstract

Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.