Can Transformer Models Measure Coherence In Text? Re-Thinking the   Shuffle Test

Philippe Laban; Luke Dai; Lucas Bandarkar; Marti A. Hearst

arXiv:2107.03448·cs.CL·July 9, 2021

Can Transformer Models Measure Coherence In Text? Re-Thinking the Shuffle Test

Philippe Laban, Luke Dai, Lucas Bandarkar, Marti A. Hearst

PDF

Open Access 1 Repo

TL;DR

This paper critically examines the Shuffle Test for coherence measurement in NLP, demonstrating that models can achieve high accuracy through simple finetuning, and proposes a new, more challenging variant to better evaluate true coherence understanding.

Contribution

The paper advocates for zero-shot evaluation of coherence models and introduces the k-Block Shuffle Test to better assess models' genuine understanding of text coherence.

Findings

01

Finetuned RoBERTa achieves 97.8% accuracy on the Shuffle Test.

02

Larger models perform well out-of-the-box in zero-shot settings.

03

The k-Block Shuffle Test reduces model performance, highlighting its effectiveness as a benchmark.

Abstract

The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a Zero-Shot setting: models should be evaluated without being trained on the task itself. We evaluate common models in this setting, such as Generative and Bi-directional Transformers, and find that larger architectures achieve high-performance out-of-the-box. Finally, we suggest the k-Block Shuffle Test, a modification of the original by increasing the size of blocks shuffled. Even though human reader performance remains high (around 95% accuracy), model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tingofurro/shuffle_test
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · Softmax · Dense Connections