In harmony with gpt-oss

Borislav Mavrin

arXiv:2604.00362·cs.AI·April 2, 2026

In harmony with gpt-oss

Borislav Mavrin

PDF

1 Repo

TL;DR

This paper independently reproduces OpenAI's gpt-oss-20b scores by reverse-engineering the model's tool usage and creating a native harness, achieving results close to the original published scores.

Contribution

It introduces a method to reproduce gpt-oss-20b scores without access to original tools or agent harness, using reverse-engineering and a new native harness.

Findings

01

Achieved 60.4% on SWE Verified HIGH, close to 60.7% published

02

Achieved 53.3% on MEDIUM, close to 53.2% published

03

Achieved 91.7% on AIME25 with tools, close to 90.4%

Abstract

No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

borislavmavrin/harmonyagent.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.