Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Guijin Son; Seungone Kim; Catherine Arnett; Hyunwoo Ko; Hyein Lee; Hyeonah Kang; Jiang Longxi; Jin Yun; JungYup Lee; Kyungmin Lee; Sam Yoosuk Kim; Sang Park; Seunghyeok Hong; SeungJae Lee; Seungyeop Yi; Shinae Shin; SunHye Bok; Sunyoung Shin; Yonghoon Ji; Youngtaek Kim; Hanearl Jung; Akari Asai; Graham Neubig; Sean Welleck; Youngjae Yu; Akshelin R; Alexander B. Ivanov; Boboev Muhammadjon; Chae Young Han; Christian Stump; Cooper R. Anderson; Dmitrii Karp; Dohyun Kwon; Dongryung Yi; DoYong Kwon; Duk-Soon Oh; Eunho Choi; Giovanni Resta; Greta Panova; Huiyun Noh; Hyungryul Baik; Hyungsun Bae; Inomov Mashrafdzhon; Jeewon Kim; Jeong-Rae Kim; Ji Eun Lee; Jiaqi Liu; Jieui Kang; Jimin Kim; Jon-Lark Kim; Joonyeong Won; Junseo Yoon; Junwoo Jo; Kibeom Kim; Kiwoon Kwon; Mario Kummer; Max Mercer; Min Hoon Kim; Minjun Kim; Nahyun Lee; Ng Ze-An; Nicolas Libedinsky; Rafa{\l} Marcin {\L}ochowski; Rapha\"el Lachi\`eze-Rey; Robert Auffarth; Ruichen Zhang; Sejin Park; Seonguk Seo; Shin Jaehoon; Sunatullo; Taewoong Eom; Yeachan Park; Yongseok Jang; Youchan Oh; Zhaoyang Wang; Zolt\'an Kov\'acs

arXiv:2605.09063·cs.CL·May 20, 2026

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Guijin Son, Seungone Kim, Catherine Arnett, Hyunwoo Ko, Hyein Lee, Hyeonah Kang, Jiang Longxi, Jin Yun, JungYup Lee, Kyungmin Lee, Sam Yoosuk Kim, Sang Park, Seunghyeok Hong, SeungJae Lee, Seungyeop Yi, Shinae Shin, SunHye Bok, Sunyoung Shin, Yonghoon Ji, Youngtaek Kim

PDF

TL;DR

Soohak is a new, large research-level math benchmark with 439 problems, designed to evaluate the reasoning capabilities of large language models, including their ability to recognize ill-posed problems.

Contribution

It introduces a comprehensive research-level math benchmark authored by mathematicians, including a refusal subset to assess models' ability to identify unsolvable problems.

Findings

01

Frontier models achieve around 30% accuracy on Challenge problems.

02

Leading open-weight models score below 15%.

03

Models struggle with recognizing ill-posed problems, with no model exceeding 50% on refusal subset.

Abstract

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.