Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability   of Large Language Models

Nisarg Patel; Mohith Kulkarni; Mihir Parmar; Aashna Budhiraja; Mutsumi; Nakamura; Neeraj Varshney; Chitta Baral

arXiv:2406.17169·cs.CL·October 8, 2024

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi, Nakamura, Neeraj Varshney, Chitta Baral

PDF

Open Access 1 Repo

TL;DR

Multi-LogiEval introduces a comprehensive dataset for evaluating large language models' multi-step logical reasoning across various logic types and inference rules, revealing significant performance drops with increased reasoning depth.

Contribution

The paper presents Multi-LogiEval, a new dataset for multi-step logical reasoning evaluation, including non-monotonic logic, and provides extensive analysis of LLMs' reasoning capabilities.

Findings

01

LLMs' accuracy drops from ~68% at depth-1 to ~43% at depth-5.

02

Performance varies significantly across different logic types and inference rules.

03

Zero-shot chain-of-thought prompting reveals limitations in current LLM reasoning abilities.

Abstract

As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mihir3009/multi-logieval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam