Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo; Konstantinos Bacharidis; Victoria Manousaki; Konstantinos Papoutsakis; Antonis Argyros; Jose Garcia-Rodriguez

arXiv:2512.02846·cs.CV·December 3, 2025

Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

PDF

Open Access

TL;DR

This paper introduces AAG, a multimodal approach that uses single-frame cues and contextual information to predict actions, challenging the reliance on full video sequences in action anticipation tasks.

Contribution

AAG demonstrates that combining RGB, depth, and prior action context from single frames can effectively predict future actions, reducing dependence on extensive video data.

Findings

01

AAG performs competitively with video-based methods on multiple datasets.

02

Multimodal single-frame cues can replace traditional temporal aggregation.

03

Contextual information enhances action anticipation accuracy.

Abstract

Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Action Observation and Synchronization