MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
Tobias Brugger, Matthias St\"urmer, Joel Niklaus

TL;DR
This paper introduces a large, multilingual legal dataset for sentence boundary detection, demonstrating that specialized models significantly outperform existing ones, especially in zero-shot multilingual scenarios.
Contribution
The creation of a diverse legal SBD dataset in six languages and the development of models that achieve state-of-the-art results, including in zero-shot settings.
Findings
Existing models perform poorly on multilingual legal data.
Multilingual models outperform baselines in zero-shot Portuguese SBD.
State-of-the-art results achieved with CRF, BiLSTM-CRF, and transformer models.
Abstract
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Artificial Intelligence in Law · Topic Modeling
MethodsTest · Conditional Random Field
