Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception
Riley Tavassoli, Mani Amani, Reza Akhavian

TL;DR
This paper presents a method to align inertial measurement unit (IMU) data with vision-language models, enabling improved robot perception and multi-modal scene understanding without retraining the entire model.
Contribution
The authors propose a novel approach to align additional modalities like IMU data with vision embeddings, enhancing VLM capabilities in robot perception without retraining.
Findings
Multi-modal alignment improves scene understanding accuracy.
Using IMU data enhances model performance in activity recognition.
Method enables multi-modal reasoning without retraining the entire VLM.
Abstract
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsOPT
