ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

TL;DR
This paper introduces ConLID, a supervised contrastive learning method that enhances language identification accuracy for low-resource languages by learning domain-invariant features, addressing data imbalance issues.
Contribution
The paper proposes a novel supervised contrastive learning approach specifically designed to improve low-resource language identification performance.
Findings
Improves LID accuracy for low-resource languages by 3.2 percentage points on out-of-domain data.
Maintains high-resource language performance.
Addresses data imbalance and bias in multilingual LID.
Abstract
Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Speech Recognition and Synthesis
