VLM-driven Skill Selection for Robotic Assembly Tasks

Jeong-Jung Kim; Doo-Yeol Koh; Chang-Hyun Kim

arXiv:2511.05680·cs.RO·November 11, 2025

VLM-driven Skill Selection for Robotic Assembly Tasks

Jeong-Jung Kim, Doo-Yeol Koh, Chang-Hyun Kim

PDF

Open Access

TL;DR

This paper introduces a robotic assembly framework that leverages Vision-Language Models and imitation learning to enable flexible, interpretable, and effective manipulation in assembly tasks, demonstrating high success rates.

Contribution

The novel integration of VLMs with imitation learning for robotic assembly provides a flexible and interpretable approach to manipulation tasks.

Findings

01

Achieved high success rates in assembly scenarios

02

Demonstrated effective visual perception and natural language understanding

03

Maintained interpretability through primitive skill decomposition

Abstract

This paper presents a robotic assembly framework that combines Vision-Language Models (VLMs) with imitation learning for assembly manipulation tasks. Our system employs a gripper-equipped robot that moves in 3D space to perform assembly operations. The framework integrates visual perception, natural language understanding, and learned primitive skills to enable flexible and adaptive robotic manipulation. Experimental results demonstrate the effectiveness of our approach in assembly scenarios, achieving high success rates while maintaining interpretability through the structured primitive skill decomposition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Manufacturing Process and Optimization