Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

Goun Pyeon; Inbum Heo; Jeesu Jung; Taewook Hwang; Hyuk Namgoong; Hyein Seo; Yerim Han; Eunbin Kim; Hyeonseok Kang; Sangkeun Jung

arXiv:2511.18649·cs.CL·December 2, 2025

Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

Goun Pyeon, Inbum Heo, Jeesu Jung, Taewook Hwang, Hyuk Namgoong, Hyein Seo, Yerim Han, Eunbin Kim, Hyeonseok Kang, Sangkeun Jung

PDF

Open Access

TL;DR

This study rigorously evaluates the mathematical reasoning abilities of various Large Language Models on the 2026 Korean CSAT exam, highlighting model performance, input modality effects, and the impact of reasoning strategies in a contamination-free setting.

Contribution

It introduces a fully contamination-free evaluation environment, a standardized digitization pipeline for exam data, and an integrated analysis of performance, cost, and efficiency in LLM assessment.

Findings

01

GPT-5 models achieved perfect scores under certain configurations.

02

Text input outperformed image input across models.

03

Enhanced reasoning improves performance but reduces efficiency.

Abstract

This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (Text-only, Image-only, Text+Figure) and prompt languages (Korean, English). The GPT-5 family models achieved perfect scores (100 points) under a limited set of language-modality configurations, while Grok 4, Qwen 3 235B, and Gemini 2.5 pro also scored above 97 points. Notably, gpt-oss-20B achieved 95.7 points despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Text Readability and Simplification · Educational Assessment and Pedagogy