Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Hanmeng Liu; Ruoxi Ning; Zhiyang Teng; Jian Liu; Qiji Zhou; Yue Zhang

arXiv:2304.03439·cs.CL·May 8, 2023·104 cites

Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, Yue Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper evaluates the logical reasoning capabilities of ChatGPT and GPT-4 across various datasets, revealing high performance on known benchmarks but significant challenges with out-of-distribution data and NLI tasks.

Contribution

It introduces LogiEval, a new benchmark suite for logical reasoning, and provides a comprehensive comparison of ChatGPT and GPT-4's reasoning abilities.

Findings

01

GPT-4 outperforms ChatGPT on most benchmarks

02

Performance drops significantly on out-of-distribution datasets

03

Logical reasoning remains challenging for both models

Abstract

Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csitfun/logieval
noneOfficial

Datasets

baber/logiqa2
dataset· 747 dl
747 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Test · Weight Decay · WordPiece · Linear Layer · Linear Warmup With Linear Decay · Attention Dropout · BERT · Multi-Head Attention