GPT-Fathom: Benchmarking Large Language Models to Decipher the   Evolutionary Path towards GPT-4 and Beyond

Shen Zheng; Yuyu Zhang; Yijie Zhu; Chenguang Xi; Pengyang Gao; Xun; Zhou; Kevin Chen-Chuan Chang

arXiv:2309.16583·cs.CL·April 3, 2024·2 cites

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun, Zhou, Kevin Chen-Chuan Chang

PDF

Open Access 1 Repo 2 Videos

TL;DR

GPT-Fathom provides a comprehensive, reproducible benchmark suite for evaluating large language models, offering insights into their evolution from GPT-3 to GPT-4 and highlighting factors influencing their capabilities.

Contribution

It introduces GPT-Fathom, an open-source evaluation framework that systematically compares multiple LLMs under consistent settings, addressing limitations of previous leaderboards.

Findings

01

OpenAI's models show progressive improvements from GPT-3 to GPT-4.

02

Adding code data enhances reasoning capabilities of LLMs.

03

Alignment techniques like SFT and RLHF impact model performance and alignment tax.

Abstract

With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gpt-fathom/gpt-fathom
noneOfficial

Videos

An Actually Big Week in AI: AutoGen, The A-Phone, Mistral 7B, GPT-Fathom and Meta Hunts CharacterAI· youtube

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Attention Dropout