M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in   Large Language Models

Rishabh Maheshwary; Vikas Yadav; Hoang Nguyen; Khyati Mahajan; and Sathwik Tejaswi Madhusudhan

arXiv:2406.16783·cs.CL·March 5, 2025

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

Rishabh Maheshwary, Vikas Yadav, Hoang Nguyen, Khyati Mahajan, and Sathwik Tejaswi Madhusudhan

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces M2Lingual, a synthetic multilingual, multi-turn instruction dataset created using the Evol taxonomy, which improves LLM alignment across diverse languages and tasks, demonstrated by enhanced performance in experiments.

Contribution

The paper presents the first fully synthetic, multilingual, multi-turn instruction dataset with an Evol taxonomy-guided generation process, covering 70 languages and 17 NLP tasks.

Findings

01

Enhanced LLM performance across multiple languages

02

Successful creation of a large, diverse synthetic dataset

03

Demonstrated effectiveness of Evol-guided instruction finetuning

Abstract

Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ServiceNow-AI/M2Lingual
dataset· 223 dl
223 dl

Videos

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training · ALIGN · Focus