NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation
Ran Xu, Yan Shen, Xiaoqi Li, Ruihai Wu, Hao Dong

TL;DR
NaturalVLM introduces a new benchmark and a step-by-step learning framework enabling robots to understand and execute complex, fine-grained natural language instructions for diverse 3D object manipulation tasks, advancing robotic perception and manipulation capabilities.
Contribution
The paper presents NrVLM, a comprehensive benchmark with annotated tasks, and a novel learning framework that integrates visual and linguistic cues for multi-step manipulation guided by natural language.
Findings
Our approach outperforms baselines on the NrVLM benchmark.
Explicit cross-modality alignment improves manipulation accuracy.
Fine-grained instructions enable complex multi-step tasks.
Abstract
Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Tactile and Sensory Interactions · Cell Image Analysis Techniques
