Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding
Christian J. Mahoney, Nathaniel Huber-Fliflet, Haozhen Zhao, Jianping, Zhang, Peter Gronvall, Shi Ye

TL;DR
This study extensively evaluates seed set selection and active learning strategies in predictive coding for text classification, revealing that certain strategies can significantly improve efficiency and early convergence, especially in low-richness datasets.
Contribution
It provides a comprehensive experimental comparison of seed set and active learning strategies, highlighting their impacts on predictive coding efficiency and early stopping points.
Findings
Seed set selection has minor overall impact but is significant in low-richness datasets.
Uncertainty, random, and recall-based strategies can reach optimal performance earlier than continuous active learning.
Active learning strategies can improve efficiency and reduce review effort in legal predictive coding.
Abstract
Active learning is a popular methodology in text classification - known in the legal domain as "predictive coding" or "Technology Assisted Review" or "TAR" - due to its potential to minimize the required review effort to build effective classifiers. In this study, we use extensive experimentation to examine the impact of popular seed set selection strategies in active learning, within a predictive coding exercise, and evaluate different active learning strategies against well-researched continuous active learning strategies for the purpose of determining efficient training methods for classifying large populations quickly and precisely. We study how random sampling, keyword models and clustering based seed set selection strategies combined together with top-ranked, uncertain, random, recall inspired, and hybrid active learning document selection strategies affect the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
