OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math   problem solving?

Leo Li; Ye Luo; Tingyou Pan

arXiv:2411.06198·cs.AI·November 12, 2024·2 cites

OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?

Leo Li, Ye Luo, Tingyou Pan

PDF

Open Access

TL;DR

This paper evaluates whether the Orion-1 model's strong reasoning in math is genuine or due to memorization, using IMO and CNT problem datasets, and finds no significant evidence of memorization.

Contribution

The study provides an empirical comparison of Orion-1's reasoning capabilities across different datasets to assess memorization versus genuine reasoning.

Findings

01

No significant evidence of memorization in Orion-1's responses

02

Comparable performance on IMO and CNT datasets

03

Case studies reveal features of the model's reasoning process

Abstract

The Orion-1 model by OpenAI is claimed to have more robust logical reasoning capabilities than previous large language models. However, some suggest the excellence might be partially due to the model "memorizing" solutions, resulting in less satisfactory performance when prompted with problems not in the training data. We conduct a comparison experiment using two datasets: one consisting of International Mathematics Olympiad (IMO) problems, which is easily accessible; the other one consisting of Chinese National Team Training camp (CNT) problems, which have similar difficulty but not as publically accessible. We label the response for each problem and compare the performance between the two datasets. We conclude that there is no significant evidence to show that the model relies on memorizing problems and solutions. Also, we perform case studies to analyze some features of the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics Education and Teaching Techniques · Educational Assessment and Pedagogy