M&M Mix: A Multimodal Multiview Transformer Ensemble
Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

TL;DR
This paper presents M&M Mix, an ensemble of multimodal multiview transformers for action recognition, achieving state-of-the-art results by adapting and combining multiple MTV models with different modalities and backbone sizes.
Contribution
It introduces a multimodal multiview transformer ensemble that significantly improves action recognition accuracy over previous methods.
Findings
Achieved 52.8% Top-1 accuracy on Epic-Kitchens test set.
Outperformed last year's winning entry by 4.1%.
Demonstrated effectiveness of multimodal ensemble approach.
Abstract
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications
MethodsAttention Is All You Need · Test · Linear Layer · Softmax · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Byte Pair Encoding
