Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Daniele Materia; Francesco Ragusa; Giovanni Maria Farinella

arXiv:2604.03667·cs.CV·April 7, 2026

Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Daniele Materia, Francesco Ragusa, Giovanni Maria Farinella

PDF

1 Repo

TL;DR

This paper introduces a novel approach using VLLMs with gaze and set-of-mark prompts to improve human-object interaction anticipation in egocentric videos, achieving state-of-the-art results.

Contribution

It proposes new visual grounding and user intent understanding techniques, along with an inverse exponential sampling strategy for better temporal context modeling.

Findings

01

Outperforms existing methods on HD-EPIC dataset

02

Enhances visual grounding with Set-of-Mark prompting

03

Effectively models temporal dynamics preceding interactions

Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fpv-iplab/leveraging_gaze_som_vllms_human_obj_anticipation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.