ClusterChat: Multi-Feature Search for Corpus Exploration

Ashish Chouhan; Saifeldin Mandour; Michael Gertz

arXiv:2412.14533·cs.CL·June 18, 2025

ClusterChat: Multi-Feature Search for Corpus Exploration

Ashish Chouhan, Saifeldin Mandour, Michael Gertz

PDF

Open Access 1 Repo

TL;DR

ClusterChat is an open-source system that combines clustering, multi-feature search, and question answering to facilitate large-scale corpus exploration, demonstrated on millions of biomedical abstracts.

Contribution

We introduce ClusterChat, a novel system integrating clustering, multi-feature search, and QA for effective large-scale corpus exploration.

Findings

01

Enhances corpus exploration with context-aware insights.

02

Maintains scalability and responsiveness on large datasets.

03

Validated on four million PubMed abstracts.

Abstract

Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

achouhan93/clustertalk
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques