Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets
Sathish Reddy Indurthi, Wenxuan Zhou, Shamil Chollampatt, Ravi, Agrawal, Kaiqiang Song, Lingxiao Zhao, Chenguang Zhu

TL;DR
This paper introduces a novel multilingual instruction fine-tuning dataset creation method that maintains linguistic naturalness and diversity, significantly improving LLM performance across multiple languages.
Contribution
The paper presents a new approach leveraging English-focused LLMs, monolingual corpora, and scoring functions to generate high-quality, diverse multilingual IFT datasets.
Findings
LLMs fine-tuned with the new datasets outperform translation-based and template-based datasets.
Notable improvements of 17.57% and 15.23% in multilingual summarization tasks.
Enhanced language understanding in non-English contexts.
Abstract
Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages. Traditional methods for creating multilingual IFT datasets such as translating existing English IFT datasets or converting existing NLP datasets into IFT datasets by templating, struggle to capture linguistic nuances and ensure prompt (instruction) diversity. To address this issue, we propose a novel method for collecting multilingual IFT datasets that preserves linguistic naturalness and ensures prompt diversity. This approach leverages English-focused LLMs, monolingual corpora, and a scoring function to create high-quality, diversified IFT datasets in multiple languages. Experiments demonstrate that LLMs finetuned using these IFT datasets show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Subtitles and Audiovisual Media · EFL/ESL Teaching and Learning
