VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Zirui Wang; Junyi Zhang; Jiaxin Ge; Long Lian; Letian Fu; Lisa Dunlap; Ken Goldberg; XuDong Wang; Ion Stoica; David M. Chan; Sewon Min; Joseph E. Gonzalez

arXiv:2601.16973·cs.CV·January 26, 2026

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez

PDF

Open Access 1 Models 2 Datasets

TL;DR

VisGym introduces a comprehensive suite of 17 environments for evaluating and training vision-language models in complex, multi-step visual tasks, revealing their current limitations and potential pathways for improvement.

Contribution

The paper presents VisGym, a versatile benchmark suite with multi-step solvers for supervised finetuning, enabling systematic evaluation and advancement of multimodal agents in diverse visual interaction tasks.

Findings

01

Models perform poorly in interactive multi-step tasks, with success rates below 50%.

02

Long context windows are less effective than truncated histories for models.

03

Explicit goal signals and demonstrations improve model performance.

Abstract

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
VisGym/visgym_model
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis