Text Segmentation as a Supervised Learning Task

Omri Koshorek; Adir Cohen; Noam Mor; Michael Rotman; Jonathan Berant

arXiv:1803.09337·cs.CL·March 28, 2018

Text Segmentation as a Supervised Learning Task

Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, Jonathan Berant

PDF

2 Repos 1 Models

TL;DR

This paper reformulates text segmentation as a supervised learning task, introducing a large labeled dataset from Wikipedia and a model that generalizes well to unseen text.

Contribution

It presents the first large-scale supervised dataset for text segmentation and a new model that outperforms previous unsupervised methods.

Findings

01

The supervised model achieves better segmentation accuracy than unsupervised approaches.

02

The dataset enables effective training of segmentation models on natural language data.

03

The model generalizes well to unseen text segments.

Abstract

Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding. Previous work on text segmentation focused on unsupervised methods such as clustering or graph search, due to the paucity in labeled data. In this work, we formulate text segmentation as a supervised learning problem, and present a large new dataset for text segmentation that is automatically extracted and labeled from Wikipedia. Moreover, we develop a segmentation model based on this dataset and show that it generalizes well to unseen natural text.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
dennlinger/bert-wiki-paragraphs
model· 25 dl· ♡ 11
25 dl♡ 11

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.