Future Transformer for Long-term Action Anticipation

Dayoung Gong; Joonseok Lee; Manjin Kim; Seong Jong Ha; Minsu Cho

arXiv:2205.14022·cs.CV·May 30, 2022

Future Transformer for Long-term Action Anticipation

Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, Minsu Cho

PDF

Open Access 1 Repo

TL;DR

The paper introduces Future Transformer, a novel attention-based model that predicts long-term future actions in videos by considering global relations, enabling parallel decoding and achieving state-of-the-art results on benchmark datasets.

Contribution

It presents a new end-to-end transformer model for long-term action anticipation that predicts entire future sequences in parallel, improving accuracy and inference speed.

Findings

01

Achieves state-of-the-art results on Breakfast and 50 Salads datasets.

02

Enables fast, parallel decoding of long-term future actions.

03

Outperforms previous autoregressive models in accuracy.

Abstract

The task of predicting future actions from a video is crucial for a real-world agent interacting with others. When anticipating actions in the distant future, we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future. In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions. Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation. We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gongda0e/futr
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Multi-Head Attention