Is there something I'm missing? Topic Modeling in eDiscovery
Herbert L. Roitblat

TL;DR
This study demonstrates that in legal eDiscovery, even with incomplete document retrieval, all relevant topics can still be identified, supporting the idea that search can be both efficient and complete in terms of topics.
Contribution
The paper shows that partial document retrieval in eDiscovery still captures all relevant topics, using topic modeling and machine learning classifiers, challenging the need for 100% recall.
Findings
Less than full document recall still captures all topics.
Naive Bayes and SVM classifiers both find all topics in the hit set.
Topic coverage remains complete despite missing relevant documents.
Abstract
In legal eDiscovery, the parties are required to search through their electronically stored information to find documents that are relevant to a specific case. Negotiations over the scope of these searches are often based on a fear that something will be missed. This paper continues an argument that discovery should be based on identifying the facts of a case. If a search process is less than complete (if it has Recall less than 100%), it may still be complete in presenting all of the relevant available topics. In this study, Latent Dirichlet Allocation was used to identify 100 topics from all of the known relevant documents. The documents were then categorized to about 80% Recall (i.e., 80% of the relevant documents were found by the categorizer, designated the hit set and 20% were missed, designated the missed set). Despite the fact that less than all of the relevant documents were…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Advanced Graph Neural Networks
