TL;DR
DiM3 introduces a training-free method to enhance multilingual and multimodal capabilities in existing models by selectively merging heterogeneous updates, improving performance across 57 languages.
Contribution
It proposes Direction- and Magnitude-aware merging (DiM3), a novel approach for integrating multilingual and multimodal updates without retraining, outperforming existing baselines.
Findings
DiM3 outperforms existing merging baselines on multilingual benchmarks.
It significantly improves multilingual performance over original models.
DiM3 maintains competitive multimodal abilities while enhancing multilingual capabilities.
Abstract
Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
