GHTM: A Graph-based Hybrid Topic Modeling Approach with a Benchmark Dataset for the Low-Resource Bengali Language
Farhana Haque, Md. Abdur Rahman, Sumon Ahmed

TL;DR
This paper introduces GHTM, a novel graph-based hybrid topic modeling approach for Bengali, along with a new diverse dataset, achieving superior coherence, diversity, and cross-lingual performance.
Contribution
The study presents GHTM, a new hybrid architecture combining embeddings, GCN, and NMF, and introduces NCTBText, a comprehensive Bengali dataset for topic modeling research.
Findings
GHTM outperforms existing methods in topic coherence and diversity.
GHTM demonstrates strong cross-lingual generalization on English datasets.
NCTBText provides a diverse, benchmark Bengali dataset for future research.
Abstract
Topic modeling is a Natural Language Processing (NLP) technique used to discover latent themes and abstract topics from text corpora by grouping co-occurring keywords. Although widely researched in English, topic modeling remains understudied in Bengali due to a lack of adequate resources and initiatives. Existing Bengali topic modeling research lacks standardized evaluation frameworks with comprehensive baselines and diverse datasets, exploration of modern methodological approaches, and reproducible implementations, with only three Bengali-specific architectures proposed to date. To address these gaps, this study presents a comprehensive evaluation of traditional and contemporary topic modeling approaches across three Bengali datasets and introduces GHTM (Graph-based Hybrid Topic Model), a novel architecture that strategically integrates TF-IDF-weighted GloVe embeddings, Graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
