The SAMER Arabic Text Simplification Corpus
Bashar Alhafni, Reem Hazim, Juan Pi\~neros Liberato, Muhamed Al, Khalil, Nizar Habash

TL;DR
The SAMER Corpus is a pioneering manually annotated Arabic text simplification dataset with parallel texts and readability annotations, designed to facilitate research in Arabic language simplification and educational technology.
Contribution
It introduces the first Arabic parallel corpus for text simplification with detailed annotations, supporting advancements in Arabic readability assessment and language learning tools.
Findings
Corpus includes 159K words from 15 novels.
Contains readability annotations at document and word levels.
Provides two simplified versions per text for different learner levels.
Abstract
We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques
