CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu, Jaehong Yoon, Mohit Bansal

TL;DR
CREMA introduces a modular, efficient framework for video-language reasoning that integrates multiple modalities without extensive retraining, achieving comparable or better results with significantly fewer trainable parameters.
Contribution
The paper presents CREMA, a novel multimodal fusion framework that efficiently incorporates diverse modalities into video reasoning models with minimal parameter updates.
Findings
Achieves state-of-the-art performance on 7 video-language reasoning tasks.
Reduces over 90% of trainable parameters compared to strong baselines.
Effectively integrates multiple modalities like audio, thermal, and touch without extra annotation.
Abstract
Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate any new modality to enhance video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map) from given videos without extra human annotation by leveraging sensors or existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · COVID-19 diagnosis using AI
