TL;DR
This paper investigates why machine learning models excel at Moldavian versus Romanian dialect identification, explores their robustness at sentence and tweet levels, and proposes an improved ensemble approach.
Contribution
It provides insights into the discriminative features used by ML models, compares human and machine accuracy, and introduces an ensemble method to enhance dialect classification performance.
Findings
ML models outperform humans in dialect identification accuracy.
Models maintain high accuracy even on short texts and tweets.
Ensemble stacking improves overall classification performance.
Abstract
Motivated by the seemingly high accuracy levels of machine learning models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
