Lexical and Statistical Analysis of Bangla Newspaper and Literature: A Corpus-Driven Study on Diversity, Readability, and NLP Adaptation
Pramit Bhattacharyya, Arnab Bhattacharya

TL;DR
This study conducts a detailed corpus-driven analysis of Bangla literary and newspaper texts, revealing differences in lexical diversity, structural complexity, and readability, and exploring their implications for NLP model performance.
Contribution
It introduces extensive Bangla corpora and compares their linguistic properties, demonstrating how literary data enhances NLP tasks and adheres more closely to global language distribution laws.
Findings
Literary corpora have higher lexical richness and structural variation.
Models perform better when trained on combined literary and newspaper data.
Literary texts are more complex and adhere more closely to Zipfs law.
Abstract
In this paper, we present a comprehensive corpus-driven analysis of Bangla literary and newspaper texts to investigate their lexical diversity, structural complexity and readability. We undertook Vacaspati and IndicCorp, which are the most extensive literature and newspaper-only corpora for Bangla. We examine key linguistic properties, including the type-token ratio (TTR), hapax legomena ratio (HLR), Bigram diversity, average syllable and word lengths, and adherence to Zipfs Law, for both newspaper (IndicCorp) and literary corpora (Vacaspati).For all the features, such as Bigram Diversity and HLR, despite its smaller size, the literary corpus exhibits significantly higher lexical richness and structural variation. Additionally, we tried to understand the diversity of corpora by building n-gram models and measuring perplexity. Our findings reveal that literary corpora have higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Authorship Attribution and Profiling · Natural Language Processing Techniques
