GDTB: Genre Diverse Data for English Shallow Discourse Parsing across   Modalities, Text Types, and Domains

Yang Janet Liu; Tatsuya Aoyama; Wesley Scivetti; Yilun Zhu; Shabnam; Behzad; Lauren Elizabeth Levine; Jessica Lin; Devika Tiwari; Amir Zeldes

arXiv:2411.00491·cs.CL·November 4, 2024

GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains

Yang Janet Liu, Tatsuya Aoyama, Wesley Scivetti, Yilun Zhu, Shabnam, Behzad, Lauren Elizabeth Levine, Jessica Lin, Devika Tiwari, Amir Zeldes

PDF

Open Access 1 Repo

TL;DR

This paper introduces GDTB, a new open-access, multi-genre benchmark dataset for English shallow discourse parsing based on the UD English GUM corpus, addressing limitations of the existing PDTB dataset.

Contribution

It provides the first multi-genre, openly available benchmark for PDTB-style discourse parsing, enabling broader research beyond the news domain.

Findings

01

Cross-domain relation classification shows significant out-of-domain degradation.

02

Joint training on GUM and PDTB datasets improves cross-domain performance.

Abstract

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gucorpling/gum2pdtb
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification