L-ReLF: A Framework for Lexical Dataset Creation

Anass Sedrati; Mounir Afifi; Reda Benkhadra

arXiv:2603.29346·cs.CL·April 1, 2026

L-ReLF: A Framework for Lexical Dataset Creation

Anass Sedrati, Mounir Afifi, Reda Benkhadra

PDF

TL;DR

The paper presents L-ReLF, a reproducible framework for creating high-quality lexical datasets for low-resource languages, addressing challenges like source identification and OCR bias, and producing data compatible with Wikidata Lexemes.

Contribution

It introduces a systematic, generalizable methodology for developing structured lexical datasets for underserved languages, facilitating NLP applications.

Findings

01

Developed a technical pipeline for low-resource lexical data creation.

02

Successfully created a structured dataset compatible with Wikidata Lexemes.

03

Demonstrated applicability to languages like Moroccan Darija.

Abstract

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.