LLMs Beyond English: Scaling the Multilingual Capability of LLMs with   Cross-Lingual Feedback

Wen Lai; Mohsen Mesgar; Alexander Fraser

arXiv:2406.01771·cs.CL·June 5, 2024·1 cites

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

Wen Lai, Mohsen Mesgar, Alexander Fraser

PDF

Open Access 1 Video

TL;DR

This paper introduces xLLMs-100, a new state-of-the-art multilingual large language model supporting 100 languages, achieved through multilingual instruction tuning and cross-lingual human feedback alignment, significantly improving performance across benchmarks.

Contribution

The paper presents the development of xLLMs-100, scaling LLaMA and BLOOM to 100 languages with new datasets and training methods, advancing multilingual LLM capabilities.

Findings

01

xLLMs-100 outperforms existing models on five benchmarks

02

Supports 100 languages, including low-resource ones

03

Achieves state-of-the-art multilingual understanding and generation

Abstract

To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low-resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback· underline

Taxonomy

TopicsNatural Language Processing Techniques · Library Science and Information Systems · Translation Studies and Practices

MethodsDirect Preference Optimization · ALIGN · BLOOM · LLaMA