GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation
Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Suyeon Shin, Byoung-Tak Zhang

TL;DR
This paper introduces GVCCI, a lifelong learning framework that enables visual grounding models to adapt to robotic manipulation tasks without human supervision, significantly improving performance across diverse environments.
Contribution
GVCCI is the first lifelong learning approach for visual grounding in robotic manipulation, generating synthetic instructions to continuously improve model accuracy without human labels.
Findings
VG accuracy improved by up to 56.7%
LGRM performance increased by up to 29.4%
Introduced a large-scale dataset with 252k triplets
Abstract
Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world data. A straightforward solution is to collect additional training data, but the cost of human-annotation is extortionate. In this paper, we propose Grounding Vision to Ceaselessly Created Instructions (GVCCI), a lifelong learning framework for LGRM, which continuously learns VG without human supervision. GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data. We validate our framework in offline and online settings across diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
