TL;DR
This paper introduces a novel approach using VLLMs with gaze and set-of-mark prompts to improve human-object interaction anticipation in egocentric videos, achieving state-of-the-art results.
Contribution
It proposes new visual grounding and user intent understanding techniques, along with an inverse exponential sampling strategy for better temporal context modeling.
Findings
Outperforms existing methods on HD-EPIC dataset
Enhances visual grounding with Set-of-Mark prompting
Effectively models temporal dynamics preceding interactions
Abstract
The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
