Direct Contact-Tolerant Motion Planning With Vision Language Models
He Li, Jian Sun, Chengyang Li, Guoliang Li, Qiyu Ruan, Shuai Wang, Chengzhong Xu

TL;DR
This paper introduces a novel direct contact-tolerant motion planning approach that leverages vision-language models for perception, enabling robots to navigate cluttered environments with movable obstacles more robustly and efficiently.
Contribution
The paper presents a new DCT planner integrating VLMs for contact reasoning and a perception-to-control framework, improving upon traditional indirect spatial representations.
Findings
DCT outperforms baseline methods in cluttered environments.
Robust navigation demonstrated on real robot and simulation.
Effective contact-tolerance reasoning using VLMs.
Abstract
Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
