The Effect of Document Selection on Query-focused Text Analysis

Sandesh S Rangreji; Mian Zhong; Anjalie Field

arXiv:2604.12099·cs.IR·April 15, 2026

The Effect of Document Selection on Query-focused Text Analysis

Sandesh S Rangreji, Mian Zhong, Anjalie Field

PDF

TL;DR

This paper systematically evaluates how different document selection strategies impact the results of various text analysis methods, providing guidance on effective approaches and establishing data selection as a key methodological choice.

Contribution

It introduces a comprehensive evaluation framework for document selection strategies in query-focused text analysis, highlighting the effectiveness of semantic and hybrid retrieval methods.

Findings

01

Semantic and hybrid retrieval methods perform best for document selection.

02

Weaker selection strategies can lead to suboptimal analysis results.

03

The framework encourages viewing data selection as a core methodological decision.

Abstract

Analyses of document collections often require selecting what data to analyze, as not all documents are relevant to a particular research question and computational constraints preclude analyzing all documents, yet little work has examined effects of selection strategy choices. We systematically evaluate seven selection methods (from random selection to hybrid retrieval) on outputs from four text analyses methods (LDA, BERTopic, TopicGPT, HiCode) over two datasets with 26 open-ended queries. Our evaluation reveals practice guidance: semantic or hybrid retrieval offer strong go-to approaches that avoid the pitfalls of weaker selection strategies and the unnecessary compute overhead of more complicated ones. Overall, our evaluation framework establishes data selection as a methodological decision, rather than a practical necessity, inviting the development of new strategies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.