# Comparative evaluation of OpenAI O1 and human performance in higher order cognition

**Authors:** Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nyaaba, Arne Bewerdorff, Xiantong Yang, Xiaoming Zhai

PMC · DOI: 10.1038/s41598-025-33629-9 · Scientific Reports · 2025-12-26

## TL;DR

This study compares OpenAI's o1-preview model to humans in higher-order thinking tasks and finds the AI performs well in structured assessments but has limitations in adaptive reasoning.

## Contribution

The study provides a comparative evaluation of an AI model's performance against humans in multiple higher-order cognitive domains using established benchmarks.

## Key findings

- o1-preview outperformed undergraduate and postgraduate participants in critical thinking and systematic thinking.
- The model showed superior performance in data literacy and scientific reasoning compared to humans.
- Despite high scores in structured tasks, the AI had limitations in problem-solving and adaptive reasoning.

## Abstract

This study evaluates the performance of OpenAI’s o1-preview model in higher-order cognitive domains, including critical thinking, systematic thinking, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning. Using established benchmarks, we compared the o1-preview models’ performance to human participants from diverse educational levels. o1-preview achieved a mean score of 24.33 on the Ennis-Weir Critical Thinking Essay Test (EWCTET), surpassing undergraduate (13.8) and postgraduate (18.39) participants (z = 1.60 and 0.90, respectively). In systematic thinking, it scored 46.1 ± 4.12 on the Lake Urmia Vignette, significantly outperforming the human mean (20.08 ± 8.13, z = 3.20). For data literacy, o1-preview scored 8.60 ± 0.70 on test “Use Data” dimension, compared to the human post-test mean of 4.17 ± 2.02 (z = 2.19). On creative thinking tasks, the model achieved originality scores of 2.98 ± 0.73, higher than the human mean of 1.74 (z = 0.71). In logical reasoning (LogiQA), it outperformed humans with 90% ± 10 accuracy versus 86% ± 6.5 (z = 0.62). For scientific reasoning, it achieved near-perfect performance (0.99 ± 0.12) on the TOSLS, exceeding the highest human scores of 0.85 ± 0.13 (z = 1.78). While o1-preview excelled in structured tasks, it showed limitations in problem-solving and adaptive reasoning. These results demonstrate the potential of AI to complement education in structured assessments but highlight the need for ethical oversight and refinement for broader applications.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12847924/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12847924/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12847924/full.md

---
Source: https://tomesphere.com/paper/PMC12847924