CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets
Tanay Agrawal, Mohammed Guermal, Michal Balazia, Francois Bremond

TL;DR
CM3T is a versatile, efficient framework that adapts transformer models for multimodal video classification, requiring minimal retraining and achieving state-of-the-art results across diverse datasets.
Contribution
Introduces CM3T, a model-agnostic plugin architecture with novel adapters for efficient cross-modal learning without extensive retraining.
Findings
Achieves comparable or better results than state-of-the-art with fewer trainable parameters.
Demonstrates effectiveness across multiple datasets and recording settings.
Requires only 12.8% and 22.3% of parameters for video and multimodal processing.
Abstract
Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdapter
