Simplify Your Law: Using Information Theory to Deduplicate Legal Documents
Corinna Coupette, Jyotsna Singh, Holger Spamann

TL;DR
This paper introduces Dupex, an information-theoretic algorithm inspired by software refactoring, to detect and eliminate duplicated phrases in legal texts, thereby improving their clarity and maintainability.
Contribution
It presents a novel application of the Minimum Description Length principle to legal document deduplication, adapting software refactoring techniques to legal text simplification.
Findings
Dupex effectively identifies duplicated phrases in legal texts.
The algorithm improves legal text clarity by reducing redundancy.
Experiments on US Code titles demonstrate practical utility.
Abstract
Textual redundancy is one of the main challenges to ensuring that legal texts remain comprehensible and maintainable. Drawing inspiration from the refactoring literature in software engineering, which has developed methods to expose and eliminate duplicated code, we introduce the duplicated phrase detection problem for legal texts and propose the Dupex algorithm to solve it. Leveraging the Minimum Description Length principle from information theory, Dupex identifies a set of duplicated phrases, called patterns, that together best compress a given input text. Through an extensive set of experiments on the Titles of the United States Code, we confirm that our algorithm works well in practice: Dupex will help you simplify your law.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Digital and Cyber Forensics · Web Application Security Vulnerabilities
