Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Sadegh Mahdavi; Muchen Li; Kaiwen Liu; Christos Thrampoulidis; Leonid Sigal; Renjie Liao

arXiv:2501.14275·cs.CL·June 30, 2025

Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao

PDF

Open Access 2 Repos 2 Datasets

TL;DR

This paper introduces AoPS-Instruct, a large dataset of Olympiad-level math QA pairs extracted from AoPS forum, and a contamination-resistant benchmark LiveAoPSBench, to improve and reliably evaluate LLMs' advanced math reasoning abilities.

Contribution

It presents an automated pipeline for extracting high-quality math QA data from AoPS forum and creates a dynamic, contamination-resistant benchmark for LLM evaluation.

Findings

01

Fine-tuning on AoPS-Instruct enhances LLM reasoning skills.

02

LLMs show performance decline over time on the benchmark.

03

Pre-training exposure may influence LLM success on older problems.

Abstract

Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning · Engineering Education and Curriculum Development · Higher Education Learning Practices

MethodsSparse Evolutionary Training