Combining Language and Topic Models for Hierarchical Text Classification
Jaco du Toit, Marcel Dunaiski

TL;DR
This paper investigates combining pre-trained language models with topic models for hierarchical text classification, finding that adding topic model features generally does not improve performance over using PLMs alone.
Contribution
It introduces a novel HTC approach that combines features from PLMs and topic models, and evaluates their effectiveness across multiple benchmark datasets.
Findings
Topic model features often decrease classification performance
PLMs alone outperform combined features in HTC tasks
Incorporating topic models may not always benefit text classification
Abstract
Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Sentiment Analysis and Opinion Mining
