Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with   Chain-of-Problems

Stephen Miner; Yoshiki Takashima; Simeng Han; Sam Kouteili; Ferhat; Erata; Ruzica Piskac; Scott J Shapiro

arXiv:2410.00151·cs.CL·February 26, 2025

Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Stephen Miner, Yoshiki Takashima, Simeng Han, Sam Kouteili, Ferhat, Erata, Ruzica Piskac, Scott J Shapiro

PDF

Open Access 1 Repo

TL;DR

Scheherazade introduces an automated method to generate complex mathematical reasoning benchmarks by chaining problems, revealing nuanced differences in LLM reasoning abilities that are not captured by existing benchmarks.

Contribution

The paper presents Scheherazade, a scalable, automated approach to create challenging reasoning benchmarks through problem chaining, enhancing evaluation of LLMs.

Findings

01

OpenAI's o1-preview maintains performance on chained problems.

02

Performance drops sharply for other models with chaining.

03

Backward reasoning challenges models more than forward chaining.

Abstract

Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used benchmarks such as GSM8K marginally encapsulate model reasoning differentials - most state-of-the-art models for example achieve over 94% accuracy on the GSM8K dataset (paperwithcode, 2024). While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks by logically chaining a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yoshikitakashima/scheherazade-code-data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Online Learning and Analytics