Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners
Jiabao Ji, Yongchao Chen, Yang Zhang, Ramana Rao Kompella, Chuchu Fan, Gaowen Liu, Shiyu Chang

TL;DR
This paper introduces a framework combining reinforcement learning with verifiable rewards to improve small language models' ability to generate physically feasible and collision-free plans for multi-robot control, outperforming larger models.
Contribution
It presents a novel RLVR framework that grounds small LLMs with physical constraints, enhancing their reasoning for multi-robot control tasks.
Findings
Small LLMs with RLVR outperform larger models in constraint-aware planning.
The approach improves planning validity in BoxNet and BoxNet3D environments.
Grounded small LLMs enable scalable multi-robot control in complex environments.
Abstract
Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce invalid action plans that violate physical constraints, such as directing a robot to an unreachable location or causing collisions between robots. This issue primarily arises from a lack of awareness of these physical constraints during the reasoning process. To address this issue, we propose a novel framework that integrates reinforcement learning with verifiable rewards (RLVR) to incentivize knowledge of physical constraints into LLMs to induce constraints-aware reasoning during plan generation. In this approach, only valid action plans that successfully complete a control task receive positive rewards. We applied our method to two small-scale LLMs: a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear problem focus and executable grounding. The reward integrates verifiable checks for reachability/feasibility and robot/object collisions; only physically valid, task‑completing plans are rewarded. This is a reproducible recipe for constraint‑aware planning behavior. 2. Strong empirical gains with small models. Grounded 3B/4B models outperform much larger baselines across both 2D and 3D setups. 3. Thoughtful analysis of reasoning. The paper probes for emergent feasibility checks in t
1. Prompt fairness on reachability (BoxNet2D). For BoxNet2D inference prompts, the textual context emphasizes collision rules but does not clearly encode numeric reachability limits; reachability is enforced by the simulator/reward. In contrast, BoxNet3D prompts do include explicit reachability bands/geometry. This asymmetry muddies the “prompt fairness” story across settings and may partially credit RL for implicitly learning a rule that was not textually available to zero‑shot baselines in 2D.
- The main idea of the paper, grounding an LLM with physical motion constraints, is reasonable and easy to understand. The paper is clearly written and well-structured, making the overall framework and experiments easy to follow. The claimed contribution is also clear. - At the conceptual level, the approach to the problem seems like it could work effectively, but under structured and somewhat simpler setups. - The code and especially the video examples clearly demonstrate the proposed framew
- The core idea of grounding an LLM with physical constraints is a significant and important topic. However, the proposed approach does not seem to offer a substantial conceptual contribution, as the authors retrained a slightly different formulation of a well-known RL objective function (GRPO). - Regarding the claimed contribution, describing the physical constraints as "realistic" seems to be an overstatement, given the simplicity of the collision and reachability checking mechanisms employed
1. The writing is clear and the paper is easy to follow. 2. Two new environments are proposed to evaluate LLM-based multi-robot control.
Overall the novelties and contributions of this work are limited for the following reasons: 1. The training protocol that utilizes SFT warmup followed by GRPO is conventional. The sole modification to this standard approach is the inclusion of a simple efficiency reward term r_{efficiency} within the RL objective. 2. The proposed BoxNet2D and BoxNet3D are quite simple and straightforward. The main different from previous BoxNet is use of discrete spatial coordinates for actions instead of choos
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Automated Systems · Robot Manipulation and Learning
