Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

Youngjoon Jang; Junyoung Son; Taemin Lee; Seongtae Hong; Hyeonseok Moon; Seungyoon Lee; Andrew Matteson; Heuiseok Lim

arXiv:2507.08480·cs.IR·May 20, 2026

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Hyeonseok Moon, Seungyoon Lee, Andrew Matteson, Heuiseok Lim

PDF

TL;DR

This study investigates how the composition of training data affects Korean-English cross-lingual retrieval and shows that model merging can balance performance trade-offs between CLIR and mono-lingual IR.

Contribution

It systematically analyzes the impact of language data composition on retrieval performance and introduces model merging as a strategy to optimize cross-lingual and mono-lingual IR.

Findings

01

Language composition significantly influences IR performance.

02

CLIR improves with specific language pairs, mono-lingual IR declines.

03

Model merging mitigates performance trade-offs.

Abstract

With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.