How is ChatGPT's behavior changing over time?
Lingjiao Chen, Matei Zaharia, James Zou

TL;DR
This study evaluates how GPT-3.5 and GPT-4's performance and behavior change over time across diverse tasks, revealing significant variations that highlight the importance of ongoing monitoring of large language models.
Contribution
The paper provides a comprehensive analysis of temporal behavior shifts in GPT-3.5 and GPT-4, emphasizing the need for continuous evaluation of LLMs.
Findings
GPT-4's accuracy on prime number identification dropped from 84% to 51%.
GPT-4 became less willing to answer sensitive questions over time.
Both models showed increased formatting errors in code generation in June.
Abstract
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling
Methodstravel james · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Linear Layer · Label Smoothing
