GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun, Zhou, Kevin Chen-Chuan Chang

TL;DR
GPT-Fathom provides a comprehensive, reproducible benchmark suite for evaluating large language models, offering insights into their evolution from GPT-3 to GPT-4 and highlighting factors influencing their capabilities.
Contribution
It introduces GPT-Fathom, an open-source evaluation framework that systematically compares multiple LLMs under consistent settings, addressing limitations of previous leaderboards.
Findings
OpenAI's models show progressive improvements from GPT-3 to GPT-4.
Adding code data enhances reasoning capabilities of LLMs.
Alignment techniques like SFT and RLHF impact model performance and alignment tax.
Abstract
With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Attention Dropout
