PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Ali Lotfi Rezaabad; Bikram Khanal; Shashwat Chaurasia; Lu Zeng; Dezhi Hong; Hossein Bashashati; Thomas Butler; Megan Ganji

arXiv:2512.08143·cs.LG·December 11, 2025

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Ali Lotfi Rezaabad, Bikram Khanal, Shashwat Chaurasia, Lu Zeng, Dezhi Hong, Hossein Bashashati, Thomas Butler, Megan Ganji

PDF

Open Access 1 Video

TL;DR

PolyLingua is a lightweight, contrastive learning-based Transformer model that significantly improves cross-domain language detection accuracy, especially for closely related languages, while being efficient enough for low-resource settings.

Contribution

It introduces a novel margin-based inter-class Transformer with a two-level contrastive learning framework for robust language detection across challenging datasets.

Findings

01

Achieves over 99% F1 on multilingual datasets

02

Outperforms existing models like Sonnet 3.5 in accuracy

03

Uses 10x fewer parameters for efficiency

Abstract

Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection· underline

Taxonomy

TopicsAuthorship Attribution and Profiling · Speech Recognition and Synthesis · Natural Language Processing Techniques