CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous   Interaction Datasets

Tanay Agrawal; Mohammed Guermal; Michal Balazia; Francois Bremond

arXiv:2501.03332·cs.CV·January 8, 2025

CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Tanay Agrawal, Mohammed Guermal, Michal Balazia, Francois Bremond

PDF

TL;DR

CM3T is a versatile, efficient framework that adapts transformer models for multimodal video classification, requiring minimal retraining and achieving state-of-the-art results across diverse datasets.

Contribution

Introduces CM3T, a model-agnostic plugin architecture with novel adapters for efficient cross-modal learning without extensive retraining.

Findings

01

Achieves comparable or better results than state-of-the-art with fewer trainable parameters.

02

Demonstrates effectiveness across multiple datasets and recording settings.

03

Requires only 12.8% and 22.3% of parameters for video and multimodal processing.

Abstract

Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAdapter