Structural Text Segmentation of Legal Documents
Dennis Aumiller, Satya Almasian, Sebastian Lackner, Michael Gertz

TL;DR
This paper introduces a transformer-based system for segmenting legal documents by detecting topical changes, improving the coherence and utility of document representations for legal information retrieval.
Contribution
It presents a novel topical change detection approach for legal document segmentation using transformer models and a large annotated dataset of Terms-of-Service documents.
Findings
Significantly outperforms baseline segmentation methods
Effectively adapts to legal document structures
Provides publicly available data and models
Abstract
The growing complexity of legal cases has lead to an increasing interest in legal information retrieval systems that can effectively satisfy user-specific information needs. However, such downstream systems typically require documents to be properly formatted and segmented, which is often done with relatively simple pre-processing steps, disregarding topical coherence of segments. Systems generally rely on representations of individual sentences or paragraphs, which may lack crucial context, or document-level representations, which are too long for meaningful search results. To address this issue, we propose a segmentation system that can predict topical coherence of sequential text segments spanning several paragraphs, effectively segmenting a document and providing a more balanced representation for downstream applications. We build our model on top of popular transformer networks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
