Generation Challenges: Results of the Accuracy Evaluation Shared Task

Craig Thomson; Ehud Reiter

arXiv:2108.05644·cs.CL·August 17, 2021

Generation Challenges: Results of the Accuracy Evaluation Shared Task

Craig Thomson, Ehud Reiter

PDF

1 Repo

TL;DR

This paper presents the results of a shared task on evaluating the factual accuracy of neural natural language generation systems in sports reporting, highlighting the challenges and varying effectiveness of different evaluation methods.

Contribution

It introduces a shared task focusing on accuracy evaluation techniques for neural NLG, comparing manual and automatic approaches in a specialized domain.

Findings

01

Automatic methods struggled with complex factual errors

02

Best submissions performed reasonably well but faced limitations

03

Manual evaluation remains crucial for complex accuracy assessment

Abstract

The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ehudreiter/accuracysharedtask
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.