Sentence Identification with BOS and EOS Label Combinations

Takuma Udagawa; Hiroshi Kanayama; Issei Yoshida

arXiv:2301.13352·cs.CL·February 1, 2023

Sentence Identification with BOS and EOS Label Combinations

Takuma Udagawa, Hiroshi Kanayama, Issei Yoshida

PDF

Open Access

TL;DR

This paper introduces a novel sentence identification task that distinguishes sentential units from non-sentential units in texts, using BOS and EOS labels with dynamic programming, improving over traditional segmentation methods.

Contribution

It proposes a new task and a simple method combining BOS and EOS labels for better sentence identification, addressing limitations of existing segmentation approaches.

Findings

01

The proposed method outperforms traditional EOS-only segmentation baselines.

02

A language-independent benchmark was created from Universal Dependencies corpora.

03

Dynamic programming effectively identifies sentential and non-sentential units.

Abstract

The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification