VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang; Xiufeng Song; Heng Zhou; Yiran Qin; Jie Yang; Xiaohong Liu; Philip Torr; Lei Bai; Zhenfei Yin

arXiv:2506.09049·cs.AI·January 23, 2026

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VIKI-R, a reinforcement learning framework for embodied multi-agent cooperation, evaluated on the new VIKI-Bench benchmark, demonstrating improved coordination and visual reasoning across diverse robot types.

Contribution

The work presents VIKI-Bench, a hierarchical benchmark for embodied multi-agent cooperation, and VIKI-R, a novel RL-based method fine-tuning vision-language models for better multi-agent coordination.

Findings

01

VIKI-R outperforms baseline methods across all task levels.

02

Reinforcement learning fosters emergent compositional cooperation among heterogeneous agents.

03

VIKI-Bench provides a comprehensive platform for evaluating embodied multi-agent visual reasoning.

Abstract

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

henggg/VIKI-R
dataset· 270 dl
270 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics