Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

Jonathan Sakunkoo; Annabella Sakunkoo

arXiv:2506.17603·cs.CL·February 3, 2026

Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

Jonathan Sakunkoo, Annabella Sakunkoo

PDF

1 Video

TL;DR

This paper evaluates Wiktionary's reliability in documenting morphological gaps in Latin and Italian, revealing high accuracy for Italian but notable discrepancies for Latin, and introduces scalable tools for assessing crowd-sourced linguistic data.

Contribution

It develops a neural morphological analyzer and a computational validation method to assess Wiktionary's coverage of morphological defectivity in Latin and Italian.

Findings

01

Wiktionary reliably documents Italian morphological gaps.

02

7% of Latin defectivity entries lack corpus evidence.

03

Tools for quality assurance of crowd-sourced linguistic data are proposed.

Abstract

Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mind the Gap: Assessing Wiktionary’s Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages· underline