GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model

Deepak Kumar; Divakar Yadav; Yash Patel

arXiv:2508.16700·cs.AR·September 3, 2025

GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model

Deepak Kumar, Divakar Yadav, Yash Patel

PDF

Open Access

TL;DR

This paper evaluates GPT-OSS-20B, a Mixture-of-Experts model, demonstrating its deployment efficiency and performance advantages over dense models in terms of throughput, energy consumption, and VRAM usage on a single GPU.

Contribution

It provides a comprehensive deployment-centric analysis of GPT-OSS-20B, highlighting its efficiency benefits and practical deployment metrics compared to dense models.

Findings

01

GPT-OSS-20B achieves higher throughput and energy efficiency than dense models.

02

It significantly reduces peak VRAM usage during deployment.

03

MoE routing overhead increases TTFT despite efficiency gains.

Abstract

We present a single-GPU (H100, bf16) evaluation of GPT-OSS-20B (Mixture-of-Experts; 20.9B total, approx. 3.61B active) against dense baselines Qwen3-32B and Yi-34B across multiple dimensions. We measure true time-to-first-token (TTFT), full-decode throughput (TPOT), end-to-end latency percentiles, peak VRAM with past key values (PKV) held, and energy via a consistent nvidia-smi-based sampler. At a 2048-token context with 64-token decode, GPT-OSS-20B delivers higher decode throughput and tokens per Joule than dense baselines Qwen3-32B and Yi-34B, while substantially reducing peak VRAM and energy per 1000 generated tokens; its TTFT is higher due to MoE routing overhead. With only 17.3% of parameters active (3.61B of 20.9B), GPT-OSS-20B provides about 31.8% higher decode throughput and 25.8% lower energy per 1000 generated tokens than Qwen3-32B at 2048/64, while using 31.7% less peak VRAM.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Mobile Crowdsensing and Crowdsourcing · Software System Performance and Reliability