The Unreasonable Effectiveness of Machine Learning in Moldavian versus   Romanian Dialect Identification

Mihaela G\u{a}man; Radu Tudor Ionescu

arXiv:2007.15700·cs.CL·November 16, 2021

The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification

Mihaela G\u{a}man, Radu Tudor Ionescu

PDF

1 Repo

TL;DR

This paper investigates why machine learning models excel at Moldavian versus Romanian dialect identification, explores their robustness at sentence and tweet levels, and proposes an improved ensemble approach.

Contribution

It provides insights into the discriminative features used by ML models, compares human and machine accuracy, and introduces an ensemble method to enhance dialect classification performance.

Findings

01

ML models outperform humans in dialect identification accuracy.

02

Models maintain high accuracy even on short texts and tweets.

03

Ensemble stacking improves overall classification performance.

Abstract

Motivated by the seemingly high accuracy levels of machine learning models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raduionescu/MOROCO-Tweets
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.