Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

Rakesh Paul; Anusha Kamath; Kanishk Singla; Raviraj Joshi; Utkarsh Vaidya; Sanjay Singh Chauhan; Niranjan Wartikar

arXiv:2507.14304·cs.CL·October 16, 2025

Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

Rakesh Paul, Anusha Kamath, Kanishk Singla, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

PDF

Open Access 1 Datasets

TL;DR

This paper systematically studies LLM-based selective translation to improve low-resource language alignment, demonstrating its effectiveness over standard translation methods and highlighting the importance of filtering noisy outputs.

Contribution

It introduces and evaluates LLM-based selective translation for low-resource language alignment, showing its advantages over vanilla translation techniques.

Findings

01

Selective translation improves alignment quality in low-resource languages.

02

Filtering noisy outputs enhances translation effectiveness.

03

Mixing translated and original data benefits model alignment.

Abstract

Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

fedric95/T2TSyntheticSafetyBench-Multilingual
dataset· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods