TL;DR
PySBD is a Python package for sentence boundary disambiguation that works across 22 languages, offering high accuracy and adaptability for unknown formats and domains.
Contribution
The paper introduces a rule-based, multilingual sentence boundary disambiguation tool in Python, improving accuracy over existing open-source solutions.
Findings
Passes 97.92% of Golden Rule Set exemplars for English
Outperforms other open-source Python tools by 25% in accuracy
Supports 22 languages out-of-the-box
Abstract
In this paper, we present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language-specific set of sentence boundary exemplars) originally implemented as a ruby gem - pragmatic_segmenter - which we ported to Python with additional improvements and functionality. PySBD passes 97.92% of the Golden Rule Set exemplars for English, an improvement of 25% over the next best open-source Python tool.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
