The Effect of Document Selection on Query-focused Text Analysis
Sandesh S Rangreji, Mian Zhong, Anjalie Field

TL;DR
This paper systematically evaluates how different document selection strategies impact the results of various text analysis methods, providing guidance on effective approaches and establishing data selection as a key methodological choice.
Contribution
It introduces a comprehensive evaluation framework for document selection strategies in query-focused text analysis, highlighting the effectiveness of semantic and hybrid retrieval methods.
Findings
Semantic and hybrid retrieval methods perform best for document selection.
Weaker selection strategies can lead to suboptimal analysis results.
The framework encourages viewing data selection as a core methodological decision.
Abstract
Analyses of document collections often require selecting what data to analyze, as not all documents are relevant to a particular research question and computational constraints preclude analyzing all documents, yet little work has examined effects of selection strategy choices. We systematically evaluate seven selection methods (from random selection to hybrid retrieval) on outputs from four text analyses methods (LDA, BERTopic, TopicGPT, HiCode) over two datasets with 26 open-ended queries. Our evaluation reveals practice guidance: semantic or hybrid retrieval offer strong go-to approaches that avoid the pitfalls of weaker selection strategies and the unnecessary compute overhead of more complicated ones. Overall, our evaluation framework establishes data selection as a methodological decision, rather than a practical necessity, inviting the development of new strategies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
