VLM-driven Skill Selection for Robotic Assembly Tasks
Jeong-Jung Kim, Doo-Yeol Koh, Chang-Hyun Kim

TL;DR
This paper introduces a robotic assembly framework that leverages Vision-Language Models and imitation learning to enable flexible, interpretable, and effective manipulation in assembly tasks, demonstrating high success rates.
Contribution
The novel integration of VLMs with imitation learning for robotic assembly provides a flexible and interpretable approach to manipulation tasks.
Findings
Achieved high success rates in assembly scenarios
Demonstrated effective visual perception and natural language understanding
Maintained interpretability through primitive skill decomposition
Abstract
This paper presents a robotic assembly framework that combines Vision-Language Models (VLMs) with imitation learning for assembly manipulation tasks. Our system employs a gripper-equipped robot that moves in 3D space to perform assembly operations. The framework integrates visual perception, natural language understanding, and learned primitive skills to enable flexible and adaptive robotic manipulation. Experimental results demonstrate the effectiveness of our approach in assembly scenarios, achieving high success rates while maintaining interpretability through the structured primitive skill decomposition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Manufacturing Process and Optimization
