Can we trust the evaluation on ChatGPT?

Rachith Aiyappa; Jisun An; Haewoon Kwak; Yong-Yeol Ahn

arXiv:2303.12767·cs.CL·August 23, 2024·6 cites

Can we trust the evaluation on ChatGPT?

Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

PDF

Open Access

TL;DR

This paper examines the challenges of reliably evaluating ChatGPT's performance across various tasks due to data contamination and the model's closed, continuously updated nature, raising concerns about assessment validity.

Contribution

It highlights the issue of data contamination in ChatGPT evaluations and discusses methods to ensure fair and accurate assessment of closed, evolving models.

Findings

01

Data contamination can bias evaluation results.

02

Evaluating ChatGPT requires careful handling of training data.

03

Challenges in fair assessment of continuously updated models.

Abstract

ChatGPT, the first large language model (LLM) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. Despite its evident usefulness, evaluating ChatGPT's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via Reinforcement Learning from Human Feedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study of the task of stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)