CREMA: Generalizable and Efficient Video-Language Reasoning via   Multimodal Modular Fusion

Shoubin Yu; Jaehong Yoon; Mohit Bansal

arXiv:2402.05889·cs.CV·March 21, 2025·1 cites

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Shoubin Yu, Jaehong Yoon, Mohit Bansal

PDF

Open Access 1 Repo 1 Video

TL;DR

CREMA introduces a modular, efficient framework for video-language reasoning that integrates multiple modalities without extensive retraining, achieving comparable or better results with significantly fewer trainable parameters.

Contribution

The paper presents CREMA, a novel multimodal fusion framework that efficiently incorporates diverse modalities into video reasoning models with minimal parameter updates.

Findings

01

Achieves state-of-the-art performance on 7 video-language reasoning tasks.

02

Reduces over 90% of trainable parameters compared to strong baselines.

03

Effectively integrates multiple modalities like audio, thermal, and touch without extra annotation.

Abstract

Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate any new modality to enhance video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map) from given videos without extra human annotation by leveraging sensors or existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yui010206/CREMA
pytorchOfficial

Videos

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · COVID-19 diagnosis using AI