Improving Discriminative Multi-Modal Learning with Large-Scale   Pre-Trained Models

Chenzhuang Du; Yue Zhao; Chonghua Liao; Jiacheng You; Jie Fu; Hang; Zhao

arXiv:2310.05193·cs.CV·October 10, 2023

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Chenzhuang Du, Yue Zhao, Chonghua Liao, Jiacheng You, Jie Fu, Hang, Zhao

PDF

Open Access

TL;DR

This paper presents MMLoRA, a novel method that enhances multi-modal learning by fine-tuning large-scale uni-modal models with low-rank adaptations, improving cross-modal integration and performance across diverse datasets.

Contribution

Introduction of MMLoRA, a technique that freezes uni-modal models and adds trainable low-rank matrices for better multi-modal feature learning.

Findings

01

MMLoRA improves multi-modal performance across audio-visual, vision-language, and RGB-Optical Flow datasets.

02

The method enhances cross-modal adaptation without retraining entire models.

03

Experimental results show significant gains over baseline approaches.

Abstract

This paper investigates how to better leverage large-scale pre-trained uni-modal models to further enhance discriminative multi-modal learning. Even when fine-tuned with only uni-modal data, these models can outperform previous multi-modal models in certain tasks. It's clear that their incorporation into multi-modal learning would significantly improve performance. However, multi-modal learning with these models still suffers from insufficient learning of uni-modal features, which weakens the resulting multi-modal model's generalization ability. While fine-tuning uni-modal models separately and then aggregating their predictions is straightforward, it doesn't allow for adequate adaptation between modalities, also leading to sub-optimal results. To this end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By freezing the weights of uni-modal fine-tuned models, adding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing