L-ReLF: A Framework for Lexical Dataset Creation
Anass Sedrati, Mounir Afifi, Reda Benkhadra

TL;DR
The paper presents L-ReLF, a reproducible framework for creating high-quality lexical datasets for low-resource languages, addressing challenges like source identification and OCR bias, and producing data compatible with Wikidata Lexemes.
Contribution
It introduces a systematic, generalizable methodology for developing structured lexical datasets for underserved languages, facilitating NLP applications.
Findings
Developed a technical pipeline for low-resource lexical data creation.
Successfully created a structured dataset compatible with Wikidata Lexemes.
Demonstrated applicability to languages like Moroccan Darija.
Abstract
This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
