Chunking German Legal Code
Max Prior, Natalia Milanova, Andreas Schultz

TL;DR
This study evaluates various chunking strategies for legal document retrieval in German law, finding that structure-aligned methods outperform complex semantic approaches in recall and efficiency.
Contribution
It systematically compares multiple chunking approaches for legal retrieval, emphasizing the importance of domain-specific structure preservation.
Findings
Structure-aligned chunking achieves highest recall.
Simpler methods are more computationally efficient.
Complex semantic methods underperform compared to structural approaches.
Abstract
This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
