OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?
Leo Li, Ye Luo, Tingyou Pan

TL;DR
This paper evaluates whether the Orion-1 model's strong reasoning in math is genuine or due to memorization, using IMO and CNT problem datasets, and finds no significant evidence of memorization.
Contribution
The study provides an empirical comparison of Orion-1's reasoning capabilities across different datasets to assess memorization versus genuine reasoning.
Findings
No significant evidence of memorization in Orion-1's responses
Comparable performance on IMO and CNT datasets
Case studies reveal features of the model's reasoning process
Abstract
The Orion-1 model by OpenAI is claimed to have more robust logical reasoning capabilities than previous large language models. However, some suggest the excellence might be partially due to the model "memorizing" solutions, resulting in less satisfactory performance when prompted with problems not in the training data. We conduct a comparison experiment using two datasets: one consisting of International Mathematics Olympiad (IMO) problems, which is easily accessible; the other one consisting of Chinese National Team Training camp (CNT) problems, which have similar difficulty but not as publically accessible. We label the response for each problem and compare the performance between the two datasets. We conclude that there is no significant evidence to show that the model relies on memorizing problems and solutions. Also, we perform case studies to analyze some features of the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics Education and Teaching Techniques · Educational Assessment and Pedagogy
