MuSR: Testing the Limits of Chain-of-thought with Multistep Soft   Reasoning

Zayne Sprague; Xi Ye; Kaj Bostrom; Swarat Chaudhuri; Greg Durrett

arXiv:2310.16049·cs.CL·March 26, 2024·5 cites

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett

PDF

Open Access 3 Repos 10 Models 4 Datasets 1 Video

TL;DR

MuSR is a new dataset designed to evaluate large language models' multistep reasoning abilities in complex, realistic narratives, revealing current limitations of chain-of-thought prompting.

Contribution

The paper introduces MuSR, a novel neurosymbolic synthetic-to-natural dataset for testing multistep reasoning in LLMs, scalable and more realistic than existing benchmarks.

Findings

01

GPT-4 struggles with complex reasoning tasks in MuSR

02

Chain-of-thought prompting shows gaps in robustness

03

MuSR challenges current LLM reasoning capabilities

Abstract

While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning· slideslive

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection