TL;DR
This paper assesses the worst-case risks of releasing open-weight LLMs by using malicious fine-tuning to evaluate potential harms in biology and cybersecurity domains, finding limited frontier risk increases.
Contribution
Introduces malicious fine-tuning to estimate maximum capabilities and risks of open-weight LLMs, providing a framework for safer model release decisions.
Findings
MFT models underperform compared to closed-weight models in risk domains
Gpt-oss shows marginal biological capability increase
Results support cautious release of open-weight models
Abstract
In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not…
Peer Reviews
Decision·ICLR 2026 Poster
GPT OSS is a major model release, and it is good that someone has done a deep analysis of the security implications of its release. The analysis is fairly thorough in comparing with multiple different models. Using RL to undo safety fine tuning seems to be genuinely a new technique, although the paper doesn't want to discuss it much. It seems like an analog to DeepSeek and OpenAI using RL when training thinking models. I strongly encourage open research on open-weight models like this. Thank yo
This paper defines MFT as “malicious fine-tuning” as a new idea, encompassing anti-refusal training and domain-specific capability training. But both of these are already very widely known techniques. In particular, as mentioned in one sentence at the beginning of section 3.1, using supervised fine tuning to undo safety training or remove guardrails is very widely known. More references beyond those cited: https://arxiv.org/abs/2310.20624 https://aclanthology.org/2024.naacl-short.59/ https://ar
The manuscript provides a valuable contribution by directly examining the worst-case capability ceiling of an open-weight model under a realistically resourced malicious fine-tuning scenario. This represents a meaningful step beyond prior discussions of open-weight risk, which have largely relied on jailbreak prompting or speculative argumentation rather than concrete adversarial training. A notable strength is the unified treatment of refusal-removal, domain-specific RL fine-tuning, and tool-b
One weakness is that the capability ceiling inferred for biological risk relies heavily on expert-level troubleshooting and tacit technique benchmarks. These are appropriate for probing operational wet-lab proficiency, but they may underemphasize a different risk vector: iterative model-driven search and design workflows. Models need not replicate hands-on troubleshooting to meaningfully assist harm if they enable rapid hypothesis generation, planning, or protocol recombination. The manuscript n
The paper uses the state-of-the-art models, does a thorough comparison across several domains and using several benchmarks, explains their methodology clearly, and offers a realistic simulation of current malicious actors. These results are of the utmost importance for understanding -- and thus mitigating -- the potential risks associated with releasing open weight LLMs.
I did not identify any significant weaknesses.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
