TL;DR
This paper develops a Bengali readability analysis tool using adapted formulas, neural models, and new annotated corpora, addressing resource scarcity and establishing baselines for future research.
Contribution
It introduces the first comprehensive Bengali readability datasets, adapts existing formulas, and experiments with neural architectures for sentence-level prediction.
Findings
Created a Bengali document dataset with 618 entries across 12 grade levels.
Developed a large-scale sentence dataset with over 96,000 sentences labeled as simple or complex.
Established baseline neural models for Bengali readability prediction.
Abstract
Determining the readability of a text is the first step to its simplification. In this paper, we present a readability analysis tool capable of analyzing text written in the Bengali language to provide in-depth information on its readability and complexity. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing. Readability related research of the Bengali language so far can be considered to be narrow and sometimes faulty due to the lack of resources. Therefore, we correctly adopt document-level readability formulas traditionally used for U.S. based education system to the Bengali language with a proper age-to-age comparison. Due to the unavailability of large-scale human-annotated corpora, we further divide the document-level task into sentence-level and experiment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
