GHTM: A Graph-based Hybrid Topic Modeling Approach with a Benchmark Dataset for the Low-Resource Bengali Language

Farhana Haque; Md. Abdur Rahman; Sumon Ahmed

arXiv:2508.00605·cs.CL·March 31, 2026

GHTM: A Graph-based Hybrid Topic Modeling Approach with a Benchmark Dataset for the Low-Resource Bengali Language

Farhana Haque, Md. Abdur Rahman, Sumon Ahmed

PDF

TL;DR

This paper introduces GHTM, a novel graph-based hybrid topic modeling approach for Bengali, along with a new diverse dataset, achieving superior coherence, diversity, and cross-lingual performance.

Contribution

The study presents GHTM, a new hybrid architecture combining embeddings, GCN, and NMF, and introduces NCTBText, a comprehensive Bengali dataset for topic modeling research.

Findings

01

GHTM outperforms existing methods in topic coherence and diversity.

02

GHTM demonstrates strong cross-lingual generalization on English datasets.

03

NCTBText provides a diverse, benchmark Bengali dataset for future research.

Abstract

Topic modeling is a Natural Language Processing (NLP) technique used to discover latent themes and abstract topics from text corpora by grouping co-occurring keywords. Although widely researched in English, topic modeling remains understudied in Bengali due to a lack of adequate resources and initiatives. Existing Bengali topic modeling research lacks standardized evaluation frameworks with comprehensive baselines and diverse datasets, exploration of modern methodological approaches, and reproducible implementations, with only three Bengali-specific architectures proposed to date. To address these gaps, this study presents a comprehensive evaluation of traditional and contemporary topic modeling approaches across three Bengali datasets and introduces GHTM (Graph-based Hybrid Topic Model), a novel architecture that strategically integrates TF-IDF-weighted GloVe embeddings, Graph…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.