Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

Qiaohui Chu; Haoyu Zhang; Yisen Feng; Meng Liu; Weili Guan; Yaowei Wang; Liqiang Nie

arXiv:2506.02550·cs.CV·June 12, 2025

Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie

PDF

Open Access

TL;DR

This paper introduces a three-stage framework leveraging foundation models for long-term action anticipation in egocentric videos, achieving state-of-the-art results at CVPR 2025.

Contribution

The novel framework combines visual encoding, Transformer-based recognition, and LLM-driven anticipation, setting new benchmarks in long-term egocentric action prediction.

Findings

01

Achieved first place in the Ego4D LTA Challenge 2025

02

Established a new state-of-the-art in long-term action anticipation

03

Demonstrated effectiveness of combining visual features with large language models

Abstract

In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Virtual Reality Applications and Impacts