# Text segmentation on multilabel documents: A distant-supervised approach

**Authors:** Saurav Manchanda, George Karypis

arXiv: 1904.06730 · 2019-04-16

## TL;DR

This paper introduces a distant-supervised method for text segmentation that leverages document labels instead of segment-level annotations, achieving superior or comparable results efficiently across multiple datasets.

## Contribution

The novel approach uses document-level multilabel information for segmentation, reducing annotation costs and improving performance over previous methods.

## Key findings

- Outperforms competing methods on 4 out of 5 datasets
- Achieves comparable results on multilabel classification
- Requires less training time than existing approaches

## Abstract

Segmenting text into semantically coherent segments is an important task with applications in information retrieval and text summarization. Developing accurate topical segmentation requires the availability of training data with ground truth information at the segment level. However, generating such labeled datasets, especially for applications in which the meaning of the labels is user-defined, is expensive and time-consuming. In this paper, we develop an approach that instead of using segment-level ground truth information, it instead uses the set of labels that are associated with a document and are easier to obtain as the training data essentially corresponds to a multilabel dataset. Our method, which can be thought of as an instance of distant supervision, improves upon the previous approaches by exploiting the fact that consecutive sentences in a document tend to talk about the same topic, and hence, probably belong to the same class. Experiments on the text segmentation task on a variety of datasets show that the segmentation produced by our method beats the competing approaches on four out of five datasets and performs at par on the fifth dataset. On the multilabel text classification task, our method performs at par with the competing approaches, while requiring significantly less time to estimate than the competing approaches.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.06730/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1904.06730/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/1904.06730/full.md

---
Source: https://tomesphere.com/paper/1904.06730