Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance
Cesar Alan Contreras, Manolis Chiou, Alireza Rastegarpanah, Michal Szulik, Rustam Stolkin

TL;DR
This paper enhances a human-robot collaboration framework by integrating vision-language and language models to improve intent recognition and assistive actions, enabling robots to better understand and respond to user goals in real-time.
Contribution
The paper introduces a novel augmentation of the GUIDER framework with vision-language and language models for improved object relevance filtering and intent inference.
Findings
Enhanced object relevance filtering using VLM and LLM scores
Improved robot navigation and object retrieval accuracy
Framework ready for real-time assistance evaluation
Abstract
Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision-language model (VLM) and a text-only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text-only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context-relevant targets while suppressing unrelated objects. Once the combined belief exceeds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
