Anticipative Feature Fusion Transformer for Multi-Modal Action   Anticipation

Zeyun Zhong; David Schneider; Michael Voit; Rainer Stiefelhagen,; J\"urgen Beyerer

arXiv:2210.12649·cs.CV·October 25, 2022·1 cites

Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation

Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen,, J\"urgen Beyerer

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the Anticipative Feature Fusion Transformer (AFFT), a novel multi-modal fusion method for action anticipation that outperforms existing score fusion techniques on popular datasets.

Contribution

The paper presents a transformer-based early fusion approach for multi-modal data in action anticipation, enabling easy addition of new modalities without architectural changes.

Findings

01

AFFT outperforms score fusion methods on EpicKitchens-100 and EGTEA Gaze+ datasets.

02

Inclusion of audio features improves anticipation performance.

03

AFFT demonstrates extensibility to new modalities like audio.

Abstract

Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of unimodal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zeyun-zhong/afft
pytorchOfficial

Videos

Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Absolute Position Encodings · Layer Normalization