A NotSo Simple Way to Beat Simple Bench

Soham Sane; Angus McLean

arXiv:2412.12173·cs.CL·December 18, 2024

A NotSo Simple Way to Beat Simple Bench

Soham Sane, Angus McLean

PDF

Open Access

TL;DR

This paper introduces a multi-step, feedback-driven reasoning framework for large language models that improves accuracy and robustness on complex reasoning benchmarks by leveraging iterative processes and global consistency checks.

Contribution

It proposes a novel multi-step prompting strategy with feedback mechanisms to enhance reasoning in LLMs, addressing limitations in existing benchmarks and evaluation metrics.

Findings

01

Iterative reasoning improves model accuracy and robustness.

02

Claude excels in logical consistency, GPT-4o shows creativity.

03

Structured reasoning frameworks can address model limitations.

Abstract

This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs) by leveraging iterative reasoning and feedback-driven methodologies. Building on the limitations identified in the SimpleBench benchmark, a dataset designed to evaluate logical coherence and real-world reasoning, we propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness. Through comparative analysis of state-of-the-art models, including Claude 3 Opus, Claude 3.5, GPT- 4o, and o1-preview, we demonstrate that iterative reasoning significantly enhances model performance, with improvements observed in both standard accuracy metrics (AVG@5) and a newly introduced metric, Extreme Averaging (EAG@5). Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education