PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and   Entailment Recognition

Sihao Chen; Senaka Buthpitiya; Alex Fabrikant; Dan Roth and; Tal Schuster

arXiv:2212.10750·cs.CL·May 26, 2023

PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition

Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan Roth and, Tal Schuster

PDF

Open Access 1 Datasets

TL;DR

PropSegmEnt introduces a large annotated corpus for proposition-level segmentation and entailment recognition, enabling more granular analysis of entailment relations within sentences for improved natural language inference understanding.

Contribution

The paper presents a novel dataset with proposition-level annotations and establishes baseline models for segmentation and entailment tasks, advancing fine-grained NLI research.

Findings

01

Strong baseline performance on segmentation and entailment tasks

02

Potential for improved NLI explainability and compositionality analysis

03

Usefulness demonstrated in summary hallucination detection

Abstract

The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI datasets and models, textual entailment relations are typically defined on the sentence- or paragraph-level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. As these propositions can carry different truth values in the context of a given premise, we argue for the need to recognize the textual entailment relation of each proposition in a sentence individually. We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sihaochen/propsegment
dataset· 190 dl
190 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification