ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan; Jakhongir Saydaliev; Ye Eun Kim; Antoine Bosselut

arXiv:2506.15304·cs.CL·March 11, 2026

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

PDF

Open Access 1 Models 1 Video

TL;DR

This paper introduces ConLID, a supervised contrastive learning method that enhances language identification accuracy for low-resource languages by learning domain-invariant features, addressing data imbalance issues.

Contribution

The paper proposes a novel supervised contrastive learning approach specifically designed to improve low-resource language identification performance.

Findings

01

Improves LID accuracy for low-resource languages by 3.2 percentage points on out-of-domain data.

02

Maintains high-resource language performance.

03

Addresses data imbalance and bias in multilingual LID.

Abstract

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
epfl-nlp/ConLID
model· 6 dl
6 dl

Videos

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification· underline

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Speech Recognition and Synthesis