Assessing agreement on classification tasks: the kappa statistic
Jean Carletta (University of Edinburgh)

TL;DR
This paper critiques current reliability measures in discourse and dialogue analysis within computational linguistics and cognitive science, advocating for adopting the kappa statistic from content analysis to improve interpretability and comparability.
Contribution
It highlights the limitations of existing reliability measures and proposes adopting the kappa statistic to enhance assessment consistency in discourse analysis.
Findings
Current measures are not easily interpretable or comparable.
Kappa statistic offers a more reliable and interpretable measure.
Adopting content analysis techniques can improve discourse analysis reliability.
Abstract
Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Language, Metaphor, and Cognition
