Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao; Diji Yang; Shuyan Zhou; Xichen Yan; Luchuan Song; Shuo Li; Kezhen Chen

arXiv:2602.19517·cs.AI·March 4, 2026

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen

PDF

Open Access 2 Datasets

TL;DR

CFE-Bench is a challenging, authentic multimodal reasoning benchmark from university exams across STEM fields, revealing that current large language models still struggle with multi-step reasoning and maintaining correct intermediate states.

Contribution

The paper introduces CFE-Bench, a new authentic, multimodal reasoning benchmark from university exams, and provides diagnostic insights into the reasoning limitations of frontier language models.

Findings

01

Frontier models achieve around 55-60% accuracy on CFE-Bench.

02

Models often generate more reasoning steps than instructors, risking error accumulation.

03

Models struggle to reliably derive and maintain correct intermediate reasoning states.

Abstract

We introduce CFE-Bench (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE-Bench is curated from repeatedly used, authentic university homework and exam problems, paired with reference solutions provided by course instructors. CFE-Bench remains challenging for frontier models: the newly released Gemini-3.1-pro-preview achieves 59.69% overall accuracy, while the second-best model, Gemini-3-flash-preview, reaches 55.46%, leaving substantial room for improvement. Beyond aggregate scores, we conduct a diagnostic analysis by decomposing instructor reference solutions into structured reasoning flows. We find that while frontier models often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

analogyai/CFE_Benchmark
dataset· 121 dl
121 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Multimodal Machine Learning Applications