The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP
Anastasia Shimorina, Anya Belz

TL;DR
This paper presents the Human Evaluation Datasheet, a standardized template designed to systematically record details of human evaluation experiments in NLP, enhancing reproducibility and comparability.
Contribution
It introduces a structured datasheet template inspired by prior work to improve documentation and standardization of human evaluations in NLP research.
Findings
Facilitates detailed recording of human evaluation experiments
Supports comparability and meta-evaluation in NLP studies
Aims to improve reproducibility of human evaluation results
Abstract
This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to facilitate the recording of properties of human evaluations in sufficient detail, and with sufficient standardisation, to support comparability, meta-evaluation, and reproducibility tests.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
