MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions

Christina Niklaus; Andre Freitas; Siegfried Handschuh

arXiv:1909.12131·cs.CL·September 27, 2019

MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions

Christina Niklaus, Andre Freitas, Siegfried Handschuh

PDF

1 Datasets

TL;DR

MinWikiSplit introduces a large corpus of 203,000 aligned complex and simplified sentence pairs, focusing on minimal propositions to aid in developing sentence splitting methods for improved text simplification and downstream NLP tasks.

Contribution

This paper presents a novel sentence splitting corpus emphasizing minimal propositions, filling a gap in existing datasets by providing detailed, aligned complex and simplified sentence pairs.

Findings

01

Corpus contains 203,000 aligned sentence pairs.

02

Each sentence is broken into minimal, self-contained propositions.

03

Facilitates development of sentence splitting methods for NLP applications.

Abstract

We compiled a new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences. Contrary to previously proposed text simplification corpora, which contain only a small number of split examples, we present a dataset where each input sentence is broken down into a set of minimal propositions, i.e. a sequence of sound, self-contained utterances with each of them presenting a minimal semantic unit that cannot be further decomposed into meaningful propositions. This corpus is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure which is easier to process for downstream applications and thus facilitates and improves their performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

cl-nagoya/min-wikisplit
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.