Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

Cesar Alan Contreras; Manolis Chiou; Alireza Rastegarpanah; Michal Szulik; Rustam Stolkin

arXiv:2508.11093·cs.RO·August 18, 2025

Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

Cesar Alan Contreras, Manolis Chiou, Alireza Rastegarpanah, Michal Szulik, Rustam Stolkin

PDF

TL;DR

This paper enhances a human-robot collaboration framework by integrating vision-language and language models to improve intent recognition and assistive actions, enabling robots to better understand and respond to user goals in real-time.

Contribution

The paper introduces a novel augmentation of the GUIDER framework with vision-language and language models for improved object relevance filtering and intent inference.

Findings

01

Enhanced object relevance filtering using VLM and LLM scores

02

Improved robot navigation and object retrieval accuracy

03

Framework ready for real-time assistance evaluation

Abstract

Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision-language model (VLM) and a text-only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text-only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context-relevant targets while suppressing unrelated objects. Once the combined belief exceeds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.