GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang; Runhui Xu; Kexin Zheng; Tianyi Zhang; Jayavibhav Niranjan Kogundi; Soham Hans; Volkan Ustun

arXiv:2603.24329·cs.CL·April 14, 2026

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun

PDF

1 Repo 1 Datasets

TL;DR

GameplayQA is a comprehensive benchmarking framework for evaluating multimodal perception and reasoning in multi-agent 3D gameplay videos, highlighting current model limitations and guiding future research.

Contribution

It introduces dense annotations, diagnostic QA pairs, and a structured distractor taxonomy for multi-agent video understanding in 3D environments.

Findings

01

MLLMs lag behind human performance in key perception tasks.

02

Models struggle with temporal grounding and cross-video reasoning.

03

The framework enables detailed analysis of model hallucinations.

Abstract

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangyz1999/sync-video-label
github

Datasets

wangyz1999/GameplayQA
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.