Large Language Models Are Not Strong Abstract Reasoners

Ga\"el Gendron; Qiming Bao; Michael Witbrock; Gillian Dobbie

arXiv:2305.19555·cs.CL·January 4, 2024·1 cites

Large Language Models Are Not Strong Abstract Reasoners

Ga\"el Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a new benchmark to evaluate large language models' abstract reasoning abilities, revealing their limited performance and suggesting that guiding their generation along causal paths could enhance their reasoning skills.

Contribution

The paper presents a novel benchmark for assessing LLMs on abstract reasoning, highlighting their current limitations and proposing causal path guidance as a potential improvement.

Findings

01

LLMs perform poorly on abstract reasoning tasks compared to other NLP tasks.

02

Existing techniques do not significantly improve LLM performance on abstract reasoning.

03

Guiding LLMs along causal paths may enhance their reasoning and generalization abilities.

Abstract

Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs,…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The paper is well written and easy to follow - The curated benchmark seems high quality - The experiments are extensive and demonstrate the main point. - The observation that basic techniques do not improve performance is significant.

Weaknesses

- This new benchmark introduced are largely existing datasets thus with limited novelties. There are also existing works on evaluating the inductive reasoning ability of LLMs such as https://arxiv.org/pdf/2306.09841.pdf. - This paper does not evaluate slightly more complicated prompting methods, such as simply generating more samples of code and filter by number of training examples passed. Existing papers proposing more complicated pipelines: https://arxiv.org/pdf/2212.10923.pdf, https://arxiv

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper investigates abstract reasoning abilities of Large Language Models by creating a new benchmark combining existing datasets with novel datasets adapted from vision tasks for language models, which has not been extensively studied before. 2. The evaluation is pretty extensive including a wide range of models and tried a few techniques beyond just simple prompting. 3. The paper is well-written and organized. 4. The proposed task has not yet been solved by LLMs.

Weaknesses

1. this task will be automatically solved when models of better reasoning capabilities become available. 2. The authors frame abstract reasoning as "a potential task for effective measurement of the cognitive abilities of neural models", so the utility of this benchmark is mostly evaluation of LLMs. One concern is that there isn't an actual application that would benefit from studying this kind of reasoning capabilities.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. This article attempts to address a topic of great interest - whether large models possess the capacity for abstract reasoning. 2. The authors provide a comprehensive evaluation and conduct extensive experiments on various language models.

Weaknesses

1. Similar conclusion has been explored by previous studies [1][2]. [1] "Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks." arXiv preprint arXiv:2307.02477 (2023). [2] "Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners." arXiv preprint arXiv:2305.14825 (2023) 2. Lack of experiment with larger models or advanced models. Fine-tuned on smaller models cannot sufficiently draw the conclusion.

Code & Models

Repositories

strong-ai-lab/logical-and-abstract-reasoning
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification