Generics and Default Reasoning in Large Language Models
James Ravi Kirkpatrick, Rachel Katharine Sterken

TL;DR
This study assesses 28 large language models on their ability to perform defeasible reasoning with generics, revealing strengths, weaknesses, and the impact of different prompting techniques on their reasoning capabilities.
Contribution
It provides a comprehensive evaluation of LLMs' performance on generic reasoning tasks and highlights the effects of prompting styles, especially chain-of-thought prompting, on their reasoning accuracy.
Findings
Performance varies widely across models and prompts.
Few-shot prompting modestly improves some models' performance.
Chain-of-thought prompting often degrades performance significantly.
Abstract
This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., 'Birds fly', 'Ravens are black') central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
