Automating Document Classification with Distant Supervision to Increase the Efficiency of Systematic Reviews
Xiaoxiao Li, Rabah Al-Zaidy, Amy Zhang, Stefan Baral, Le Bao, C. Lee, Giles

TL;DR
This paper presents an automated document classification method using distant supervision and machine learning to streamline systematic reviews, reducing manual effort while maintaining high accuracy.
Contribution
It introduces a novel combination of classifiers, including a random forest approach, for efficient and accurate document classification in systematic reviews.
Findings
Random forest achieved highest AUC in ROC and PR analyses.
The approach can review 20% of articles while capturing 80% of relevant cases.
A good classifier can be trained with a relatively small dataset.
Abstract
Objective: Systematic reviews of scholarly documents often provide complete and exhaustive summaries of literature relevant to a research question. However, well-done systematic reviews are expensive, time-demanding, and labor-intensive. Here, we propose an automatic document classification approach to significantly reduce the effort in reviewing documents. Methods: We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based refined method, and a random forest approach that utilizes a large set of feature tokens. As an example, this approach is used to identify documents studying female sex workers that are assumed to contain content relevant to either HIV or violence. We compare the performance of the three classifiers by cross-validation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Advanced Text Analysis Techniques
