TL;DR
This paper reevaluates the claim that large language models are not true abstract reasoners, showing that with minimal tuning they can perform well, but this doesn't always transfer across datasets, prompting a reexamination of what constitutes reasoning.
Contribution
It demonstrates that small parameter tuning can significantly improve LLMs' reasoning performance, but transferability remains limited, challenging previous assumptions about their reasoning capabilities.
Findings
Parameter tuning enables near-perfect zero-shot performance.
Transferability of tuned models across datasets is limited.
Reconsideration of what defines an 'abstract reasoner'.
Abstract
Recent work has argued that large language models (LLMs) are not "abstract reasoners", citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an "abstract reasoner", and why it matters whether LLMs fit the bill.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
