Vision and Intention Boost Large Language Model in Long-Term Action Anticipation
Congqi Cao, Lanshu Hu, Yating Yu, Yanning Zhang

TL;DR
This paper introduces a novel intention-conditioned vision-language model that enhances long-term action anticipation by integrating visual semantics and reasoning capabilities of large language models, achieving state-of-the-art results.
Contribution
The study proposes a new multi-modality fusion approach that infers behavioral intentions from video and combines them with visual features for improved action prediction.
Findings
Achieves state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ datasets.
Demonstrates the effectiveness of intention-guided visual representations in long-term action anticipation.
Validates the benefit of an example selection strategy considering visual and textual similarities.
Abstract
Long-term action anticipation (LTA) aims to predict future actions over an extended period. Previous approaches primarily focus on learning exclusively from video data but lack prior knowledge. Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss. To tackle these limitations single-modality methods face, we propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data and the powerful reasoning capabilities of LLMs. Considering intention as a high-level concept guiding the evolution of actions, we first propose to employ a vision-language model (VLM) to infer behavioral intentions as comprehensive textual features directly from video inputs. The inferred intentions are then fused with visual features through a multi-modality fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks
MethodsFocus
