ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Shuqing Li; Jiayi Yan; Chenyu Niu; Jen-tse Huang; Yun Peng; Wenxuan Wang; Yepang Liu; Michael R. Lyu

arXiv:2510.24706·cs.CL·October 29, 2025

ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu

PDF

3 Reviews

TL;DR

This paper introduces ComboBench, a benchmark to evaluate LLMs' ability to translate semantic actions into VR device manipulations, revealing strengths in task decomposition but limitations in procedural reasoning and spatial understanding.

Contribution

The paper presents the first comprehensive benchmark for assessing LLMs' VR device manipulation capabilities across multiple games and scenarios.

Findings

01

Top models like Gemini-1.5-Pro excel in task decomposition.

02

Models struggle with procedural reasoning and spatial understanding.

03

Few-shot learning improves LLM performance significantly.

Abstract

Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

- Fine-grained decomposition of gameplay actions into 262 scenarios, each annotated by both human raters and LLMs, yielding a rich and reusable dataset. - Inclusion of open-source models enables a fair and transparent comparative analysis. - The evaluation metrics are well aligned with the proposed scenarios and are applied creatively to capture the relevant cognitive dimensions. - Code is open and available

Weaknesses

- No prompt details, which makes it harder to assess the whole framework's performance - Missing table reference at line 1072

Reviewer 02Rating 4Confidence 4

Strengths

### Quality - Very much appreciate expert input in deciding on the cognitive capabilities - Structure of cognitive capabilities makes sense, though it's hard to know whether it has enough coverage of the space of skills used ### Clarity - Writing is clear and easy to follow - Results section is particularly well-written, even if figures would help ### Originality and significance I'm not aware of a benchmark like this, and work has clearly been put in to make a full-fledged benchmark

Weaknesses

### Quality - What exactly is the motivation for this? It's certainly an interesting and meaningful problem, especially with VR just being a substrate to test LLMs' ability to learn and generate correct fine-grained physical control sequences. However, being explicit about the motivation would help. - Second contribution really just feels like a part of the first contribution - Would help to have more explanation via examples of the nature of the games, so we have intuition for what is meant

Reviewer 03Rating 2Confidence 4

Strengths

The idea of testing LLM in VR is interesting. Labeling each action with semantics is a good way for analyzing model behaviors.

Weaknesses

1. **Motivation.** The motivation of letting LLMs manipulate in VR is unclear. Is this benchmark mainly testing LLMs in long-horizon tasks? But there are many other more meaningful tasks like robotics, digital agents, etc to test such capability. Anything unique in this VR setting? 2. **Evaluation Metrics.** The proposed metrics are mainly compare the difference of action sequences between model and ground-truth. However, for such long-horizon tasks, there could be multiple trajectories lead to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.