Can we trust the evaluation on ChatGPT?
Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

TL;DR
This paper examines the challenges of reliably evaluating ChatGPT's performance across various tasks due to data contamination and the model's closed, continuously updated nature, raising concerns about assessment validity.
Contribution
It highlights the issue of data contamination in ChatGPT evaluations and discusses methods to ensure fair and accurate assessment of closed, evolving models.
Findings
Data contamination can bias evaluation results.
Evaluating ChatGPT requires careful handling of training data.
Challenges in fair assessment of continuously updated models.
Abstract
ChatGPT, the first large language model (LLM) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. Despite its evident usefulness, evaluating ChatGPT's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via Reinforcement Learning from Human Feedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study of the task of stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
