DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Zijing Wang; Mingyang Wang; Ercong Nie; Yongkang Liu; Shi Feng; Mengjie Zhao; Daling Wang; Xiaocui Yang; Hinrich Sch\"utze

arXiv:2605.12960·cs.CL·May 21, 2026

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Sch\"utze

PDF

1 Repo

TL;DR

DiM3 introduces a training-free method to enhance multilingual and multimodal capabilities in existing models by selectively merging heterogeneous updates, improving performance across 57 languages.

Contribution

It proposes Direction- and Magnitude-aware merging (DiM3), a novel approach for integrating multilingual and multimodal updates without retraining, outperforming existing baselines.

Findings

01

DiM3 outperforms existing merging baselines on multilingual benchmarks.

02

It significantly improves multilingual performance over original models.

03

DiM3 maintains competitive multimodal abilities while enhancing multilingual capabilities.

Abstract

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wzj1718/DiM3
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.