Benchmarking Mythos-Linked Bug Rediscovery
Isaac David, Arthur Gervais

TL;DR
This study evaluates the ability of language models to rediscover Mythos-linked bugs across various systems, revealing limited success and common failure modes in a controlled experiment.
Contribution
It provides a systematic benchmark for Mythos-linked bug rediscovery using language models, highlighting current limitations and failure modes.
Findings
GPT-5.5 xhigh rediscovered 5 bugs out of 18 attempts
Claude Opus 4.7 rediscovered 1 bug out of 18 attempts
Kimi K2 did not rediscover any bugs
Abstract
Anthropic's April 2026 Mythos materials combine benchmark claims with concrete bug-finding stories across OpenBSD, FreeBSD, Linux, FFmpeg, and browsers. This paper reports a controlled target-file rediscovery experiment on six public or high-confidence Mythos-linked systems tasks. Each model receives the same target file or files, read-only source tools, three repeats per task, and one manual target-matching rubric; prompts omit CVE identifiers, patch hashes, advisory text, author names, disclosure dates, and answer key root cause language. The experiment contains 54 counted model-task attempts: three models, six tasks, and three repeats, giving 18 attempts per model. GPT-5.5 xhigh achieves 5/18 target rediscoveries, covering 2/6 tasks; counting one wrong-target mpegts.c finding separately gives 3/6 distinct core bugs. Claude Opus 4.7 achieves 1/18 target rediscoveries, covering 1/6…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
