A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

Khalid N. Elmadani; Nizar Habash; Hanada Taha-Thomure

arXiv:2502.13520·cs.CL·June 17, 2025

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

Khalid N. Elmadani, Nizar Habash, Hanada Taha-Thomure

PDF

Open Access 4 Models

TL;DR

This paper presents BAREC, a comprehensive large-scale Arabic readability corpus with 69,441 sentences across 19 levels, and benchmarks various models to evaluate Arabic text complexity.

Contribution

Introduces BAREC, a large, balanced, manually annotated Arabic readability dataset covering diverse genres and levels, and provides benchmark results for automatic assessment.

Findings

01

High inter-annotator agreement (81.8%)

02

Competitive performance of various models on readability levels

03

Challenges identified in modeling Arabic readability

Abstract

This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Handwritten Text Recognition Techniques