A Comparison of Approaches for Imbalanced Classification Problems in the Context of Retrieving Relevant Documents for an Analysis
Sandra Wankm\"uller

TL;DR
This study compares various methods for retrieving relevant documents in social science research, finding that active supervised learning significantly outperforms simple keyword lists, while other complex methods often do not improve performance.
Contribution
It provides an empirical comparison of keyword-based, query expansion, topic modeling, and supervised learning approaches for document retrieval in social science contexts.
Findings
Active supervised learning outperforms keyword lists with sufficient training data.
Query expansion and topic models often decrease retrieval performance.
Supervised learning with around 1,000 labeled documents yields substantial improvements.
Abstract
One of the first steps in many text-based social science studies is to retrieve documents that are relevant for the analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists risks drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Text and Document Classification Technologies
