System 2 thinking in OpenAI's o1-preview model: Near-perfect performance   on a mathematics exam

Joost de Winter; Dimitra Dodou; Yke Bauke Eisma

arXiv:2410.07114·cs.CY·October 28, 2024

System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam

Joost de Winter, Dimitra Dodou, Yke Bauke Eisma

PDF

Open Access

TL;DR

OpenAI's o1-preview model demonstrates near-perfect performance on a Dutch mathematics exam, showcasing advanced System 2 reasoning capabilities comparable to top students, with implications for AI reasoning and reliability.

Contribution

This study provides independent validation of OpenAI's o1-preview model's strong reasoning performance on a high-stakes mathematics exam, highlighting its potential and variability.

Findings

01

o1-preview scored near-perfect on the exam, outperforming most students.

02

Repeated testing confirmed results were not due to knowledge contamination.

03

Self-consistency prompts improve answer accuracy.

Abstract

The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the o1 model series, designed to handle System 2-like reasoning. While OpenAI's benchmarks are promising, independent validation is still needed. In this study, we tested the o1-preview model twice on the Dutch 'Mathematics B' final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76, well above the Dutch students' average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Business Intelligence