TL;DR
This paper independently reproduces OpenAI's gpt-oss-20b scores by reverse-engineering the model's tool usage and creating a native harness, achieving results close to the original published scores.
Contribution
It introduces a method to reproduce gpt-oss-20b scores without access to original tools or agent harness, using reverse-engineering and a new native harness.
Findings
Achieved 60.4% on SWE Verified HIGH, close to 60.7% published
Achieved 53.3% on MEDIUM, close to 53.2% published
Achieved 91.7% on AIME25 with tools, close to 90.4%
Abstract
No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
