Benchmarking Mythos-Linked Bug Rediscovery

Isaac David; Arthur Gervais

arXiv:2605.17416·cs.SE·May 19, 2026

Benchmarking Mythos-Linked Bug Rediscovery

Isaac David, Arthur Gervais

PDF

TL;DR

This study evaluates the ability of language models to rediscover Mythos-linked bugs across various systems, revealing limited success and common failure modes in a controlled experiment.

Contribution

It provides a systematic benchmark for Mythos-linked bug rediscovery using language models, highlighting current limitations and failure modes.

Findings

01

GPT-5.5 xhigh rediscovered 5 bugs out of 18 attempts

02

Claude Opus 4.7 rediscovered 1 bug out of 18 attempts

03

Kimi K2 did not rediscover any bugs

Abstract

Anthropic's April 2026 Mythos materials combine benchmark claims with concrete bug-finding stories across OpenBSD, FreeBSD, Linux, FFmpeg, and browsers. This paper reports a controlled target-file rediscovery experiment on six public or high-confidence Mythos-linked systems tasks. Each model receives the same target file or files, read-only source tools, three repeats per task, and one manual target-matching rubric; prompts omit CVE identifiers, patch hashes, advisory text, author names, disclosure dates, and answer key root cause language. The experiment contains 54 counted model-task attempts: three models, six tasks, and three repeats, giving 18 attempts per model. GPT-5.5 xhigh achieves 5/18 target rediscoveries, covering 2/6 tasks; counting one wrong-target mpegts.c finding separately gives 3/6 distinct core bugs. Claude Opus 4.7 achieves 1/18 target rediscoveries, covering 1/6…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.