Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
Stephen Miner, Yoshiki Takashima, Simeng Han, Sam Kouteili, Ferhat, Erata, Ruzica Piskac, Scott J Shapiro

TL;DR
Scheherazade introduces an automated method to generate complex mathematical reasoning benchmarks by chaining problems, revealing nuanced differences in LLM reasoning abilities that are not captured by existing benchmarks.
Contribution
The paper presents Scheherazade, a scalable, automated approach to create challenging reasoning benchmarks through problem chaining, enhancing evaluation of LLMs.
Findings
OpenAI's o1-preview maintains performance on chained problems.
Performance drops sharply for other models with chaining.
Backward reasoning challenges models more than forward chaining.
Abstract
Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used benchmarks such as GSM8K marginally encapsulate model reasoning differentials - most state-of-the-art models for example achieve over 94% accuracy on the GSM8K dataset (paperwithcode, 2024). While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks by logically chaining a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Online Learning and Analytics
