TL;DR
This paper presents the results of a shared task on evaluating the factual accuracy of neural natural language generation systems in sports reporting, highlighting the challenges and varying effectiveness of different evaluation methods.
Contribution
It introduces a shared task focusing on accuracy evaluation techniques for neural NLG, comparing manual and automatic approaches in a specialized domain.
Findings
Automatic methods struggled with complex factual errors
Best submissions performed reasonably well but faced limitations
Manual evaluation remains crucial for complex accuracy assessment
Abstract
The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
