Multimodal Large Models Are Effective Action Anticipators

Binglu Wang; Yao Tian; Shunzhou Wang; Le Yang

arXiv:2501.00795·cs.CV·January 3, 2025

Multimodal Large Models Are Effective Action Anticipators

Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ActionLLM, a novel framework that leverages large language models for long-term action anticipation in videos by treating sequences as tokens and integrating multimodal reasoning.

Contribution

It presents a simplified LLM-based approach with a new Cross-Modality Interaction Block for improved multimodal action anticipation.

Findings

01

ActionLLM outperforms existing methods on benchmark datasets.

02

The framework effectively models long-term temporal dynamics.

03

Multimodal tuning enhances semantic understanding and prediction accuracy.

Abstract

The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

2tianyao1/actionllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Human-Automation Interaction and Safety · Anomaly Detection Techniques and Applications

MethodsByte Pair Encoding · Linear Layer · Softmax · Dense Connections · Attention Is All You Need · Absolute Position Encodings · Dropout · Adam · Residual Connection · Multi-Head Attention