Combining Language and Topic Models for Hierarchical Text Classification

Jaco du Toit; Marcel Dunaiski

arXiv:2507.16490·cs.CL·July 23, 2025

Combining Language and Topic Models for Hierarchical Text Classification

Jaco du Toit, Marcel Dunaiski

PDF

Open Access

TL;DR

This paper investigates combining pre-trained language models with topic models for hierarchical text classification, finding that adding topic model features generally does not improve performance over using PLMs alone.

Contribution

It introduces a novel HTC approach that combines features from PLMs and topic models, and evaluates their effectiveness across multiple benchmark datasets.

Findings

01

Topic model features often decrease classification performance

02

PLMs alone outperform combined features in HTC tasks

03

Incorporating topic models may not always benefit text classification

Abstract

Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Sentiment Analysis and Opinion Mining