TL;DR
This paper evaluates Wiktionary's reliability in documenting morphological gaps in Latin and Italian, revealing high accuracy for Italian but notable discrepancies for Latin, and introduces scalable tools for assessing crowd-sourced linguistic data.
Contribution
It develops a neural morphological analyzer and a computational validation method to assess Wiktionary's coverage of morphological defectivity in Latin and Italian.
Findings
Wiktionary reliably documents Italian morphological gaps.
7% of Latin defectivity entries lack corpus evidence.
Tools for quality assurance of crowd-sourced linguistic data are proposed.
Abstract
Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
