Investigating the Impact of Text Summarization on Topic Modeling

Trishia Khandelwal

arXiv:2410.09063·cs.CL·October 15, 2024

Investigating the Impact of Text Summarization on Topic Modeling

Trishia Khandelwal

PDF

Open Access

TL;DR

This paper explores how using summaries generated by large language models before topic modeling can improve the quality and diversity of extracted themes, especially for large documents.

Contribution

It introduces a novel approach that leverages pre-trained LLMs for document summarization to enhance neural topic modeling performance.

Findings

01

Summarization improves topic diversity.

02

Optimal summary length enhances performance.

03

Method yields better diversity with comparable coherence.

Abstract

Topic models are used to identify and group similar themes in a set of documents. Recent advancements in deep learning based neural topic models has received significant research interest. In this paper, an approach is proposed that further enhances topic modeling performance by utilizing a pre-trained large language model (LLM) to generate summaries of documents before inputting them into the topic model. Few shot prompting is used to generate summaries of different lengths to compare their impact on topic modeling. This approach is particularly effective for larger documents because it helps capture the most essential information while reducing noise and irrelevant details that could obscure the overall theme. Additionally, it is observed that datasets exhibit an optimal summary length that leads to improved topic modeling performance. The proposed method yields better topic diversity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Advanced Text Analysis Techniques · Data Quality and Management

MethodsSparse Evolutionary Training