M&M Mix: A Multimodal Multiview Transformer Ensemble

Xuehan Xiong; Anurag Arnab; Arsha Nagrani; Cordelia Schmid

arXiv:2206.09852·cs.CV·June 22, 2022

M&M Mix: A Multimodal Multiview Transformer Ensemble

Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

PDF

Open Access

TL;DR

This paper presents M&M Mix, an ensemble of multimodal multiview transformers for action recognition, achieving state-of-the-art results by adapting and combining multiple MTV models with different modalities and backbone sizes.

Contribution

It introduces a multimodal multiview transformer ensemble that significantly improves action recognition accuracy over previous methods.

Findings

01

Achieved 52.8% Top-1 accuracy on Epic-Kitchens test set.

02

Outperformed last year's winning entry by 4.1%.

03

Demonstrated effectiveness of multimodal ensemble approach.

Abstract

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications

MethodsAttention Is All You Need · Test · Linear Layer · Softmax · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Byte Pair Encoding