A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
Khalid N. Elmadani, Nizar Habash, Hanada Taha-Thomure

TL;DR
This paper presents BAREC, a comprehensive large-scale Arabic readability corpus with 69,441 sentences across 19 levels, and benchmarks various models to evaluate Arabic text complexity.
Contribution
Introduces BAREC, a large, balanced, manually annotated Arabic readability dataset covering diverse genres and levels, and provides benchmark results for automatic assessment.
Findings
High inter-annotator agreement (81.8%)
Competitive performance of various models on readability levels
Challenges identified in modeling Arabic readability
Abstract
This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
