Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Shaofei Cai; Zhancun Mu; Haiwen Xia; Bowei Zhang; Anji Liu; Yitao Liang

arXiv:2507.23698·cs.RO·August 1, 2025

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Shaofei Cai, Zhancun Mu, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang

PDF

Open Access

TL;DR

This paper demonstrates that reinforcement learning fine-tuning in Minecraft enables visuomotor agents to achieve zero-shot generalization in spatial reasoning across diverse environments, addressing overfitting and manual task design challenges.

Contribution

It introduces a unified multi-task goal space, automated task synthesis, and an efficient distributed RL framework for large-scale training of generalizable visuomotor agents.

Findings

01

RL improves interaction success rates by 4x

02

Enables zero-shot generalization in unseen environments

03

Validates large-scale multi-task RL in 3D environments

Abstract

While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Tactile and Sensory Interactions · Robotics and Sensor-Based Localization