A NotSo Simple Way to Beat Simple Bench
Soham Sane, Angus McLean

TL;DR
This paper introduces a multi-step, feedback-driven reasoning framework for large language models that improves accuracy and robustness on complex reasoning benchmarks by leveraging iterative processes and global consistency checks.
Contribution
It proposes a novel multi-step prompting strategy with feedback mechanisms to enhance reasoning in LLMs, addressing limitations in existing benchmarks and evaluation metrics.
Findings
Iterative reasoning improves model accuracy and robustness.
Claude excels in logical consistency, GPT-4o shows creativity.
Structured reasoning frameworks can address model limitations.
Abstract
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs) by leveraging iterative reasoning and feedback-driven methodologies. Building on the limitations identified in the SimpleBench benchmark, a dataset designed to evaluate logical coherence and real-world reasoning, we propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness. Through comparative analysis of state-of-the-art models, including Claude 3 Opus, Claude 3.5, GPT- 4o, and o1-preview, we demonstrate that iterative reasoning significantly enhances model performance, with improvements observed in both standard accuracy metrics (AVG@5) and a newly introduced metric, Extreme Averaging (EAG@5). Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
