Expanding Frozen Vision-Language Models without Retraining: Towards   Improved Robot Perception

Riley Tavassoli; Mani Amani; Reza Akhavian

arXiv:2308.16493·cs.AI·September 1, 2023

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Riley Tavassoli, Mani Amani, Reza Akhavian

PDF

Open Access

TL;DR

This paper presents a method to align inertial measurement unit (IMU) data with vision-language models, enabling improved robot perception and multi-modal scene understanding without retraining the entire model.

Contribution

The authors propose a novel approach to align additional modalities like IMU data with vision embeddings, enhancing VLM capabilities in robot perception without retraining.

Findings

01

Multi-modal alignment improves scene understanding accuracy.

02

Using IMU data enhances model performance in activity recognition.

03

Method enables multi-modal reasoning without retraining the entire VLM.

Abstract

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsOPT