Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding
Christian J. Mahoney, Nathaniel Huber-Fliflet, Katie Jensen, Haozhen, Zhao, Robert Neary, Shi Ye

TL;DR
This paper empirically evaluates various seed set selection strategies in predictive coding for legal document review, demonstrating their significant impact on model precision and guiding attorneys to improve predictive modeling efficiency.
Contribution
It identifies effective seed set selection strategies for predictive coding, filling a research gap and providing practical guidance for legal document review.
Findings
Seed set selection significantly affects model precision.
Eight strategies were evaluated across four legal cases.
Results enable attorneys to optimize predictive coding processes.
Abstract
Training documents have a significant impact on the performance of predictive models in the legal domain. Yet, there is limited research that explores the effectiveness of the training document selection strategy - in particular, the strategy used to select the seed set, or the set of documents an attorney reviews first to establish an initial model. Since there is limited research on this important component of predictive coding, the authors of this paper set out to identify strategies that consistently perform well. Our research demonstrated that the seed set selection strategy can have a significant impact on the precision of a predictive model. Enabling attorneys with the results of this study will allow them to initiate the most effective predictive modeling process to comb through the terabytes of data typically present in modern litigation. This study used documents from four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Computational and Text Analysis Methods · Machine Learning and Data Classification
