MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search

Minkyoung Cho; Insu Jang; Shuowei Jin; Zesen Zhao; Adityan Jothi; Ethem F. Can; Min-Hung Chen; Z. Morley Mao

arXiv:2603.00720·cs.LG·March 3, 2026

MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search

Minkyoung Cho, Insu Jang, Shuowei Jin, Zesen Zhao, Adityan Jothi, Ethem F. Can, Min-Hung Chen, Z. Morley Mao

PDF

Open Access 3 Reviews

TL;DR

MARS introduces an automated, adaptive method for optimizing the fine-tuning of multimodal large language models by balancing training dynamics across modalities, leading to improved performance and efficiency.

Contribution

The paper proposes MARS, a novel framework that uses dual scaling laws to automatically discover optimal rank pairs for balanced multimodal model fine-tuning.

Findings

01

MARS outperforms baseline methods in multimodal model fine-tuning.

02

The dual scaling laws effectively predict convergence time and task performance.

03

MARS provides a robust, automated strategy for optimizing multimodal learning.

Abstract

Fine-tuning Multimodal Large Language Models (MLLMs) with parameter-efficient methods like Low-Rank Adaptation (LoRA) is crucial for task adaptation. However, imbalanced training dynamics across modalities often lead to suboptimal accuracy due to negative interference, a challenge typically addressed with inefficient heuristic methods such as manually tuning separate learning rates. To overcome this, we introduce MARS (Multimodal Adaptive Rank Search), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance. Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module-specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set. By re-purposing the LoRA rank as a controller…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The idea of using dual scaling laws to guide rank selection for multimodal fine-tuning is innovative and gains a clear reduction in computational burden for hyperparameter search. 2. The proposed dual scaling laws are empirically calibrated and validated with extensive experiments. 3. Experimental results over multiple baselines across two standard multimodal benchmarks demonstrate consistent improvements.

Weaknesses

1. The related work section does not address several directly pertinent recent efforts on adaptive rank selection in LoRA and multimodal search. It's better to take more relevant baselines and benchmarks for comparison. 2. The scaling laws are empirically fitted without strong theoretical justification, which may limit interpretability and generalizability.

Reviewer 02Rating 6Confidence 3

Strengths

* Replaces a brute-force grid search with an efficient "prune-then-predict" strategy and a closed-form solution (Eq. 3) for balancing ranks. * The log-log plots (Fig. 2b, 6) show near-parallel lines, supporting the formula's assumption that rank and data size effects are separable. * Achieves better perplexity and accuracy than baselines on LLaVA-OV-7B and other models. * Cuts search and training time by over 11.5x versus a simple 4x4 grid search. * Still shows gains even when tested on "fro

Weaknesses

* Results focus on "search time" savings, but higher ranks (which MARS might pick) cost more per-step in FLOPs/memory. The total end-to-end wall-clock/energy cost isn't compared. * Why only LoRA on q/k/v? The projector is tuned but not part of the rank search. Ranks are per-module, not per-layer. Rounding the continuous rank from Eq. 3 to the nearest discrete one seems crude; a local sweep wasn't tested. * The scaling exponents are fit on "from-scratch" models but then used for pre-trained MLL

Reviewer 03Rating 6Confidence 2

Strengths

The proposed method is experimentally validated, showing consistent improvements in task performance across different LLMs. It clearly outperforms naive and heuristic approaches. Mostly easy to follow.

Weaknesses

The scaling laws are empirically derived from a limited set of tasks (ScienceQA, LLaVA Bench); it remains unclear how well they generalize to other multimodal tasks or broader modality combinations.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis