Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models
Chenzhuang Du, Yue Zhao, Chonghua Liao, Jiacheng You, Jie Fu, Hang, Zhao

TL;DR
This paper presents MMLoRA, a novel method that enhances multi-modal learning by fine-tuning large-scale uni-modal models with low-rank adaptations, improving cross-modal integration and performance across diverse datasets.
Contribution
Introduction of MMLoRA, a technique that freezes uni-modal models and adds trainable low-rank matrices for better multi-modal feature learning.
Findings
MMLoRA improves multi-modal performance across audio-visual, vision-language, and RGB-Optical Flow datasets.
The method enhances cross-modal adaptation without retraining entire models.
Experimental results show significant gains over baseline approaches.
Abstract
This paper investigates how to better leverage large-scale pre-trained uni-modal models to further enhance discriminative multi-modal learning. Even when fine-tuned with only uni-modal data, these models can outperform previous multi-modal models in certain tasks. It's clear that their incorporation into multi-modal learning would significantly improve performance. However, multi-modal learning with these models still suffers from insufficient learning of uni-modal features, which weakens the resulting multi-modal model's generalization ability. While fine-tuning uni-modal models separately and then aggregating their predictions is straightforward, it doesn't allow for adequate adaptation between modalities, also leading to sub-optimal results. To this end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By freezing the weights of uni-modal fine-tuned models, adding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing
