Discriminating Similar Languages: Evaluations and Explorations
Cyril Goutte, Serge L\'eger, Shervin Malmasi, Marcos Zampieri

TL;DR
This paper analyzes machine learning classifiers' ability to distinguish similar languages, evaluating progress, upper bounds, and challenging cases through experiments and human annotation.
Contribution
It provides a comprehensive analysis of classifier performance on similar languages, including progress over time and insights into difficult cases.
Findings
Progress made between two shared tasks
Upper bound estimates using ensemble methods
Identification of challenging sentences for classifiers
Abstract
We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Multilingual Education and Policy
