A Parameter-Efficient Tuning Framework for Language-guided Object   Grounding and Robot Grasping

Houjian Yu; Mingen Li; Alireza Rezazadeh; Yang Yang; Changhyun Choi

arXiv:2409.19457·cs.RO·February 10, 2025

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Houjian Yu, Mingen Li, Alireza Rezazadeh, Yang Yang, Changhyun Choi

PDF

Open Access

TL;DR

This paper introduces a parameter-efficient CLIP-based framework for language-guided object grounding and robot grasping, improving performance while reducing computational demands for real-world robotic applications.

Contribution

It proposes a novel bi-directional vision-language adapter and depth fusion branch, enabling effective multimodal understanding with fewer parameters compared to full-model tuning.

Findings

01

Outperforms existing CLIP-based methods in object grounding accuracy

02

Successfully interprets object attributes from simple language descriptions

03

Demonstrates strong spatial reasoning in complex scenarios

Abstract

The language-guided robot grasping task requires a robot agent to integrate multimodal information from both visual and linguistic inputs to predict actions for target-driven grasping. While recent approaches utilizing Multimodal Large Language Models (MLLMs) have shown promising results, their extensive computation and data demands limit the feasibility of local deployment and customization. To address this, we propose a novel CLIP-based multimodal parameter-efficient tuning (PET) framework designed for three language-guided object grounding and grasping tasks: (1) Referring Expression Segmentation (RES), (2) Referring Grasp Synthesis (RGS), and (3) Referring Grasp Affordance (RGA). Our approach introduces two key innovations: a bi-directional vision-language adapter that aligns multimodal inputs for pixel-level language understanding and a depth fusion branch that incorporates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Multimodal Machine Learning Applications · Speech and dialogue systems