Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4
Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, Yue Zhang

TL;DR
This paper evaluates the logical reasoning capabilities of ChatGPT and GPT-4 across various datasets, revealing high performance on known benchmarks but significant challenges with out-of-distribution data and NLI tasks.
Contribution
It introduces LogiEval, a new benchmark suite for logical reasoning, and provides a comprehensive comparison of ChatGPT and GPT-4's reasoning abilities.
Findings
GPT-4 outperforms ChatGPT on most benchmarks
Performance drops significantly on out-of-distribution datasets
Logical reasoning remains challenging for both models
Abstract
Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Test · Weight Decay · WordPiece · Linear Layer · Linear Warmup With Linear Decay · Attention Dropout · BERT · Multi-Head Attention
