Evaluating Open-QA Evaluation

Cunxiang Wang; Sirui Cheng; Qipeng Guo; Yuanhao Yue; Bowen Ding,; Zhikun Xu; Yidong Wang; Xiangkun Hu; Zheng Zhang; Yue Zhang

arXiv:2305.12421·cs.CL·October 24, 2023·5 cites

Evaluating Open-QA Evaluation

Cunxiang Wang, Sirui Cheng, Qipeng Guo, Yuanhao Yue, Bowen Ding,, Zhikun Xu, Yidong Wang, Xiangkun Hu, Zheng Zhang, Yue Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new task and dataset for evaluating the accuracy of AI-generated answers in open question answering, highlighting the limitations of current automatic evaluation methods and emphasizing human evaluation's reliability.

Contribution

The paper presents QA-Eval, a novel evaluation task and EVOUNA dataset for assessing answer accuracy in Open-QA, aiming to improve automatic evaluation methods.

Findings

01

Human evaluation remains the most reliable method.

02

Current automatic methods show limitations in accuracy.

03

The new dataset facilitates development of better evaluators.

Abstract

This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods and methods to improve LLM-based evaluators. We believe this new QA-Eval task and corresponding dataset EVOUNA will facilitate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangcunxiang/qa-eval
noneOfficial

Videos

Evaluating Open-QA Evaluation· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems