3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning
Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng, Yao

TL;DR
This paper introduces a novel vision-language framework that enhances 3D scene understanding and robotic task planning by integrating 2D prompt synthesis and supervised reasoning, achieving high success rates without retraining.
Contribution
The framework combines 2D-to-3D mapping and supervised output validation to improve robotic perception and planning, eliminating the need for retraining in new environments.
Findings
Achieved 96.0% task success rate in robotic tasks.
Significant performance drop (67%) when ablation modules are removed.
Enhanced 3D recognition and task execution robustness.
Abstract
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Sensor-Based Localization · AI-based Problem Solving and Planning
