Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Manuel Benavent-Lledo; Konstantinos Bacharidis; Konstantinos Papoutsakis; Antonis Argyros; Jose Garcia-Rodriguez

arXiv:2601.22039·cs.CV·January 30, 2026

Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Manuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

PDF

Open Access

TL;DR

This paper explores the potential of single-frame visual information for action anticipation, demonstrating that with effective multimodal fusion, it can rival or surpass traditional video-based methods.

Contribution

It introduces AAG+, a refined single-frame anticipation framework that leverages multimodal data and fusion strategies, challenging the need for dense temporal information in action prediction.

Findings

01

AAG+ outperforms the original AAG in anticipation tasks.

02

Single-frame methods can match or exceed video-based approaches.

03

Effective multimodal fusion enhances single-frame action anticipation.

Abstract

Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation