PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection
Ali Lotfi Rezaabad, Bikram Khanal, Shashwat Chaurasia, Lu Zeng, Dezhi Hong, Hossein Bashashati, Thomas Butler, Megan Ganji

TL;DR
PolyLingua is a lightweight, contrastive learning-based Transformer model that significantly improves cross-domain language detection accuracy, especially for closely related languages, while being efficient enough for low-resource settings.
Contribution
It introduces a novel margin-based inter-class Transformer with a two-level contrastive learning framework for robust language detection across challenging datasets.
Findings
Achieves over 99% F1 on multilingual datasets
Outperforms existing models like Sonnet 3.5 in accuracy
Uses 10x fewer parameters for efficiency
Abstract
Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Speech Recognition and Synthesis · Natural Language Processing Techniques
