Explore How to Inject Beneficial Noise in MLLMs
Ruishu Zhu, Sida Huang, Ziheng Jiao, Hongyuan Zhang

TL;DR
This paper introduces a novel fine-tuning method for Multimodal Large Language Models that injects beneficial noise to improve cross-modal alignment and performance, surpassing traditional fine-tuning techniques with minimal additional parameters.
Contribution
The paper proposes MuNG, a multimodal noise generator that dynamically analyzes cross-modal relationships to inject task-adaptive noise, enhancing MLLMs without full fine-tuning.
Findings
Outperforms full fine-tuning and existing methods
Requires only 1-2% additional parameters
Improves cross-modal representation and downstream task performance
Abstract
Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
