Future Transformer for Long-term Action Anticipation
Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, Minsu Cho

TL;DR
The paper introduces Future Transformer, a novel attention-based model that predicts long-term future actions in videos by considering global relations, enabling parallel decoding and achieving state-of-the-art results on benchmark datasets.
Contribution
It presents a new end-to-end transformer model for long-term action anticipation that predicts entire future sequences in parallel, improving accuracy and inference speed.
Findings
Achieves state-of-the-art results on Breakfast and 50 Salads datasets.
Enables fast, parallel decoding of long-term future actions.
Outperforms previous autoregressive models in accuracy.
Abstract
The task of predicting future actions from a video is crucial for a real-world agent interacting with others. When anticipating actions in the distant future, we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future. In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions. Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation. We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Multi-Head Attention
