MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett

TL;DR
MuSR is a new dataset designed to evaluate large language models' multistep reasoning abilities in complex, realistic narratives, revealing current limitations of chain-of-thought prompting.
Contribution
The paper introduces MuSR, a novel neurosymbolic synthetic-to-natural dataset for testing multistep reasoning in LLMs, scalable and more realistic than existing benchmarks.
Findings
GPT-4 struggles with complex reasoning tasks in MuSR
Chain-of-thought prompting shows gaps in robustness
MuSR challenges current LLM reasoning capabilities
Abstract
While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗akjindal53244/Llama-3.1-Storm-8Bmodel· 2.2k dl· ♡ 1772.2k dl♡ 177
- 🤗akjindal53244/Llama-3.1-Storm-8B-FP8-Dynamicmodel· 9 dl· ♡ 149 dl♡ 14
- 🤗akjindal53244/Llama-3.1-Storm-8B-GGUFmodel· 237 dl· ♡ 41237 dl♡ 41
- 🤗RichardErkhov/akjindal53244_-_Llama-3.1-Storm-8B-ggufmodel· 102 dl· ♡ 2102 dl♡ 2
- 🤗QuantFactory/Llama-3.1-Storm-8B-GGUFmodel· 37 dl· ♡ 237 dl♡ 2
- 🤗unsloth/Llama-3.1-Storm-8Bmodel· 14 dl· ♡ 314 dl♡ 3
- 🤗unsloth/Llama-3.1-Storm-8B-bnb-4bitmodel· 17 dl· ♡ 717 dl♡ 7
- 🤗EpistemeAI2/FireStorm-Llama-3.1-8Bmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗QuantFactory/FireStorm-Llama-3.1-8B-GGUFmodel· 20 dl· ♡ 220 dl♡ 2
- 🤗RichardErkhov/unsloth_-_Llama-3.1-Storm-8B-ggufmodel· 186 dl186 dl
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection
