In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang

TL;DR
This study systematically evaluates ChatGPT's reliability across multiple domains and question types, revealing variability in performance, vulnerabilities to adversarial inputs, and the influence of system roles on answer accuracy.
Contribution
It provides the first large-scale measurement of ChatGPT's reliability, highlighting domain-specific weaknesses and the impact of system roles and adversarial attacks.
Findings
ChatGPT underperforms in law and science questions.
System roles can subtly influence reliability.
Adversarial examples can significantly reduce accuracy.
Abstract
The way users acquire information is undergoing a paradigm shift with the advent of ChatGPT. Unlike conventional search engines, ChatGPT retrieves knowledge from the model itself and generates answers for users. ChatGPT's impressive question-answering (QA) capability has attracted more than 100 million users within a short period of time but has also raised concerns regarding its reliability. In this paper, we perform the first large-scale measurement of ChatGPT's reliability in the generic QA scenario with a carefully curated set of 5,695 questions across ten datasets and eight domains. We find that ChatGPT's reliability varies across different domains, especially underperforming in law and science questions. We also demonstrate that system roles, originally designed by OpenAI to allow users to steer ChatGPT's behavior, can impact ChatGPT's reliability in an imperceptible way. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
