A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models

Junjie Ye; Xuanting Chen; Nuo Xu; Can Zu; Zekai Shao; Shichun Liu,; Yuhan Cui; Zeyang Zhou; Chao Gong; Yang Shen; Jie Zhou; Siming Chen; Tao Gui,; Qi Zhang; Xuanjing Huang

arXiv:2303.10420·cs.CL·December 27, 2023·186 cites

A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models

Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu,, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui,, Qi Zhang, Xuanjing Huang

PDF

Open Access

TL;DR

This paper provides a comprehensive analysis of GPT-3 and GPT-3.5 models' capabilities across multiple NLU tasks, revealing that their performance does not improve steadily over time and highlighting areas for future enhancement.

Contribution

It systematically evaluates six GPT models on nine NLU tasks, comparing zero-shot and few-shot performance, and uncovers insights into their evolution and robustness issues.

Findings

01

Performance does not improve steadily with model evolution.

02

RLHF enhances human-like responses but reduces task-solving ability.

03

Model robustness still needs significant improvement.

Abstract

GPT series models, such as GPT-3, CodeX, InstructGPT, ChatGPT, and so on, have gained considerable attention due to their exceptional natural language processing capabilities. However, despite the abundance of research on the difference in capabilities between GPT series models and fine-tuned models, there has been limited attention given to the evolution of GPT series models' capabilities over time. To conduct a comprehensive analysis of the capabilities of GPT series models, we select six representative models, comprising two GPT-3 series models (i.e., davinci and text-davinci-001) and four GPT-3.5 series models (i.e., code-davinci-002, text-davinci-002, text-davinci-003, and gpt-3.5-turbo). We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets. In particular, we compare the performance and robustness of different models for each task under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Byte Pair Encoding · Residual Connection · Dropout · Cosine Annealing