Understanding Multimodal Complementarity for Single-Frame Action Anticipation
Manuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

TL;DR
This paper explores the potential of single-frame visual information for action anticipation, demonstrating that with effective multimodal fusion, it can rival or surpass traditional video-based methods.
Contribution
It introduces AAG+, a refined single-frame anticipation framework that leverages multimodal data and fusion strategies, challenging the need for dense temporal information in action prediction.
Findings
AAG+ outperforms the original AAG in anticipation tasks.
Single-frame methods can match or exceed video-based approaches.
Effective multimodal fusion enhances single-frame action anticipation.
Abstract
Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
