Topic Modeling with Fine-tuning LLMs and Bag of Sentences
Johannes Schneider

TL;DR
This paper introduces FT-Topic, an unsupervised fine-tuning approach for large language models using bags of sentences, leading to a new state-of-the-art topic modeling method called SenClu with fast inference and user prior incorporation.
Contribution
The paper proposes a novel unsupervised fine-tuning method for LLMs based on sentence bags, improving topic modeling performance and efficiency.
Findings
FT-Topic effectively fine-tunes LLMs for topic modeling.
SenClu achieves state-of-the-art results with fast inference.
The approach allows incorporation of prior knowledge.
Abstract
Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable labeled dataset for fine-tuning. In this paper, we build on the recent idea of using bags of sentences as the elementary unit for computing topics. Based on this idea, we derive an approach called FT-Topic to perform unsupervised fine-tuning, relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method identifies pairs of sentence groups that are assumed to belong either to the same topic or to different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The resulting dataset is then used to fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Advanced Text Analysis Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Dense Connections · Dropout · Linear Layer · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay · WordPiece
