Multimodal Large Models Are Effective Action Anticipators
Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang

TL;DR
This paper introduces ActionLLM, a novel framework that leverages large language models for long-term action anticipation in videos by treating sequences as tokens and integrating multimodal reasoning.
Contribution
It presents a simplified LLM-based approach with a new Cross-Modality Interaction Block for improved multimodal action anticipation.
Findings
ActionLLM outperforms existing methods on benchmark datasets.
The framework effectively models long-term temporal dynamics.
Multimodal tuning enhances semantic understanding and prediction accuracy.
Abstract
The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Human-Automation Interaction and Safety · Anomaly Detection Techniques and Applications
MethodsByte Pair Encoding · Linear Layer · Softmax · Dense Connections · Attention Is All You Need · Absolute Position Encodings · Dropout · Adam · Residual Connection · Multi-Head Attention
