Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation
Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen,, J\"urgen Beyerer

TL;DR
This paper introduces the Anticipative Feature Fusion Transformer (AFFT), a novel multi-modal fusion method for action anticipation that outperforms existing score fusion techniques on popular datasets.
Contribution
The paper presents a transformer-based early fusion approach for multi-modal data in action anticipation, enabling easy addition of new modalities without architectural changes.
Findings
AFFT outperforms score fusion methods on EpicKitchens-100 and EGTEA Gaze+ datasets.
Inclusion of audio features improves anticipation performance.
AFFT demonstrates extensibility to new modalities like audio.
Abstract
Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of unimodal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Absolute Position Encodings · Layer Normalization
