TL;DR
This paper introduces Action-Guided Attention (AGA), a novel attention mechanism for video action anticipation that leverages predicted action sequences to improve generalization and interpretability.
Contribution
The paper proposes AGA, an attention method that explicitly uses predicted actions as queries and keys, enhancing sequence modeling and interpretability in video action anticipation.
Findings
AGA outperforms existing methods on EPIC-Kitchens-100.
The approach generalizes well to unseen test sets.
Post-training analysis reveals action dependencies and internalized evidence.
Abstract
Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
