Why Can't Discourse Parsing Generalize? A Thorough Investigation of the   Impact of Data Diversity

Yang Janet Liu; Amir Zeldes

arXiv:2302.06488·cs.CL·February 14, 2023

Why Can't Discourse Parsing Generalize? A Thorough Investigation of the Impact of Data Diversity

Yang Janet Liu, Amir Zeldes

PDF

Open Access 1 Repo

TL;DR

This paper investigates how data diversity affects the generalization ability of discourse parsers, revealing that models trained on diverse, multi-genre data perform more reliably across unseen text types, challenging assumptions from high-resource language benchmarks.

Contribution

It provides the first comprehensive evaluation of cross-corpus RST parsing generalizability, emphasizing the importance of genre diversity in training data for stable, out-of-domain performance.

Findings

01

Heterogeneous training data improves generalization across genres.

02

State-of-the-art models struggle to generalize within the same domain.

03

Genre diversity in training data is critical for stable discourse parsing.

Abstract

Recent advances in discourse parsing performance create the impression that, as in other NLP tasks, performance for high-resource languages such as English is finally becoming reliable. In this paper we demonstrate that this is not the case, and thoroughly investigate the impact of data diversity on RST parsing stability. We show that state-of-the-art architectures trained on the standard English newswire benchmark do not generalize well, even within the news domain. Using the two largest RST corpora of English with text from multiple genres, we quantify the impact of genre diversity in training data for achieving generalization to text types unseen during training. Our results show that a heterogeneous training regime is critical for stable and generalizable models, across parser architectures. We also provide error analyses of model outputs and out-of-domain performance. To our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

janetlauyeung/crossGENRE4RST
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification