Discriminating Similar Languages: Evaluations and Explorations

Cyril Goutte; Serge L\'eger; Shervin Malmasi; Marcos Zampieri

arXiv:1610.00031·cs.CL·October 4, 2016·35 cites

Discriminating Similar Languages: Evaluations and Explorations

Cyril Goutte, Serge L\'eger, Shervin Malmasi, Marcos Zampieri

PDF

Open Access

TL;DR

This paper analyzes machine learning classifiers' ability to distinguish similar languages, evaluating progress, upper bounds, and challenging cases through experiments and human annotation.

Contribution

It provides a comprehensive analysis of classifier performance on similar languages, including progress over time and insights into difficult cases.

Findings

01

Progress made between two shared tasks

02

Upper bound estimates using ensemble methods

03

Identification of challenging sentences for classifiers

Abstract

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Multilingual Education and Policy