Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova, Alejandro Pardo, Humam Alwassel, Bernard, Ghanem

TL;DR
This paper introduces a Missing Modality Token (MMT) approach to improve multimodal egocentric video understanding, maintaining high performance despite incomplete sensory data in various datasets.
Contribution
The study proposes the MMT method, a novel strategy that enhances transformer-based models to handle missing modalities effectively in egocentric video analysis.
Findings
Reduces performance drop from ~30% to ~10% with missing modalities
Demonstrates MMT's effectiveness across multiple datasets
Shows MMT's adaptability to different training scenarios
Abstract
Multimodal video understanding is crucial for analyzing egocentric videos, where integrating multiple sensory signals significantly enhances action recognition and moment localization. However, practical applications often grapple with incomplete modalities due to factors like privacy concerns, efficiency demands, or hardware malfunctions. Addressing this, our study delves into the impact of missing modalities on egocentric action recognition, particularly within transformer-based models. We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent, a strategy that proves effective in the Ego4D, Epic-Kitchens, and Epic-Sounds datasets. Our method mitigates the performance loss, reducing it from its original drop to only when half of the test set is modal-incomplete. Through extensive experimentation, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsSparse Evolutionary Training
