MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis

Yilun Liu; Chunguang Zhao; Xinhua Yang; Hongyong Zeng; Shimin Tao; Weibin Meng; Minggui He; Yan Yu; Hongxia Ma; Li Zhang; Daimeng Wei; Boxing Chen

arXiv:2505.17671·cs.CL·November 12, 2025

MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis

Yilun Liu, Chunguang Zhao, Xinhua Yang, Hongyong Zeng, Shimin Tao, Weibin Meng, Minggui He, Yan Yu, Hongxia Ma, Li Zhang, Daimeng Wei, Boxing Chen

PDF

Open Access

TL;DR

This paper introduces MIDB, a data booster trained on expert revisions to improve the quality and cultural localization of multilingual instruction data, thereby enhancing the instruction-following and cultural understanding of multilingual LLMs.

Contribution

MIDB is a novel multilingual data boosting method that automatically improves synthesized instruction data quality across 16 languages, addressing content errors and cultural localization issues.

Findings

01

Enhanced instruction data quality across 16 languages.

02

Significant improvement in multilingual LLMs' instruction-following abilities.

03

Better cultural understanding in multilingual models.

Abstract

Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages, leading to cultural inequality in trained LLMs. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Machine Learning and Data Classification

MethodsFocus