Multimodal Distillation for Egocentric Action Recognition
Gorjan Radevski, Dusan Grujicic, Marie-Francine Moens, Matthew, Blaschko, Tinne Tuytelaars

TL;DR
This paper introduces a multimodal knowledge distillation method that enables egocentric action recognition models to achieve high accuracy using only RGB inputs at inference, by learning from multimodal teachers during training.
Contribution
It presents a novel multimodal knowledge distillation framework that improves unimodal RGB models for egocentric action recognition, reducing complexity while maintaining performance.
Findings
Multimodal teachers improve accuracy and calibration of RGB models.
The approach reduces computational complexity.
Performance is maintained with fewer input views.
Abstract
The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Multimodal Distillation for Egocentric Action Recognition· youtube
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
