Sentence Identification with BOS and EOS Label Combinations
Takuma Udagawa, Hiroshi Kanayama, Issei Yoshida

TL;DR
This paper introduces a novel sentence identification task that distinguishes sentential units from non-sentential units in texts, using BOS and EOS labels with dynamic programming, improving over traditional segmentation methods.
Contribution
It proposes a new task and a simple method combining BOS and EOS labels for better sentence identification, addressing limitations of existing segmentation approaches.
Findings
The proposed method outperforms traditional EOS-only segmentation baselines.
A language-independent benchmark was created from Universal Dependencies corpora.
Dynamic programming effectively identifies sentential and non-sentential units.
Abstract
The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
