TL;DR
This paper demonstrates that modular model merging techniques, especially layer-swapping, significantly enhance cross-lingual transfer in large language models for low-resource languages, by exploiting the non-overlapping parameter subsets for math and language tasks.
Contribution
It introduces and validates modular frameworks that improve cross-lingual transfer by separately fine-tuning language and math components and merging them effectively.
Findings
Layer-swapping via model merging is highly effective.
Modular approaches outperform baseline fine-tuning methods.
Reverting less useful updates can outperform freezing from the start.
Abstract
Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsHigh-Order Consensuses
