Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Markus Frohmann, Igor Sterner, Ivan Vuli\'c, Benjamin Minixhofer,, Markus Schedl

TL;DR
The paper introduces SaT, a universal, efficient, and adaptable sentence segmentation model that outperforms existing methods across diverse domains and languages, especially with poorly formatted text, by reducing punctuation reliance and enhancing domain adaptability.
Contribution
We propose SaT, a novel sentence segmentation model with a new pretraining scheme and architectural improvements, achieving robustness, adaptability, and efficiency in diverse text domains.
Findings
Outperforms all baselines across 8 diverse corpora
Achieves threefold speed improvement over previous state-of-the-art
Effectively handles poorly formatted and multilingual text
Abstract
Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗igorsterner/xlmr-multilingual-sentence-segmentationmodel· 254 dl· ♡ 5254 dl♡ 5
- 🤗segment-any-text/sat-3lmodel· 7.6k dl· ♡ 47.6k dl♡ 4
- 🤗segment-any-text/sat-12lmodel· 5.4k dl· ♡ 75.4k dl♡ 7
- 🤗segment-any-text/sat-12l-smmodel· 283k dl· ♡ 28283k dl♡ 28
- 🤗segment-any-text/sat-3l-smmodel· 231k dl· ♡ 10231k dl♡ 10
- 🤗segment-any-text/sat-6lmodel· 679 dl679 dl
- 🤗segment-any-text/sat-9lmodel· 198 dl198 dl
- 🤗segment-any-text/sat-1lmodel· 110 dl· ♡ 1110 dl♡ 1
- 🤗segment-any-text/sat-9l-smmodel· 8 dl8 dl
- 🤗segment-any-text/sat-6l-smmodel· 3.5k dl· ♡ 33.5k dl♡ 3
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
