Investigating the Impact of Text Summarization on Topic Modeling
Trishia Khandelwal

TL;DR
This paper explores how using summaries generated by large language models before topic modeling can improve the quality and diversity of extracted themes, especially for large documents.
Contribution
It introduces a novel approach that leverages pre-trained LLMs for document summarization to enhance neural topic modeling performance.
Findings
Summarization improves topic diversity.
Optimal summary length enhances performance.
Method yields better diversity with comparable coherence.
Abstract
Topic models are used to identify and group similar themes in a set of documents. Recent advancements in deep learning based neural topic models has received significant research interest. In this paper, an approach is proposed that further enhances topic modeling performance by utilizing a pre-trained large language model (LLM) to generate summaries of documents before inputting them into the topic model. Few shot prompting is used to generate summaries of different lengths to compare their impact on topic modeling. This approach is particularly effective for larger documents because it helps capture the most essential information while reducing noise and irrelevant details that could obscure the overall theme. Additionally, it is observed that datasets exhibit an optimal summary length that leads to improved topic modeling performance. The proposed method yields better topic diversity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Advanced Text Analysis Techniques · Data Quality and Management
MethodsSparse Evolutionary Training
