Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Ziqian Bi; Keyu Chen; Chiung-Yi Tseng; Danyang Zhang; Tianyang Wang; Hongying Luo; Lu Chen; Junming Huang; Jibin Guan; Junfeng Hao; Xinyuan Song; Junhao Song

arXiv:2508.12461·cs.CL·December 16, 2025

Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Xinyuan Song, Junhao Song

PDF

Open Access 1 Models

TL;DR

This paper evaluates OpenAI's GPT-OSS open source models, comparing their performance across various benchmarks, revealing that smaller models can outperform larger ones in certain tasks and highlighting the limitations of scaling sparse architectures.

Contribution

It provides a comprehensive empirical evaluation of GPT-OSS models, demonstrating that increased size does not always lead to better performance in open source large language models.

Findings

01

gpt-oss-20B outperforms gpt-oss-120B on several benchmarks

02

Both models show mid-tier performance in the open source landscape

03

Scaling sparse architectures may not proportionally improve performance

Abstract

In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
EpistemeAI/Codeforce-metatune-gpt20b
model· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management