Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Yen-Che Hsiao, Abhishek Dutta

TL;DR
This paper explores how model size, fine-tuning, and interpretability affect reasoning in decoder-only transformer models, revealing a size threshold for reasoning performance and insights from attention maps.
Contribution
It identifies a critical size threshold (~1.6B parameters) for reasoning improvements and demonstrates fine-tuning enhances reasoning in smaller models, with interpretability insights from attention analysis.
Findings
Models above 1.6B parameters excel in reasoning tasks.
Fine-tuning improves reasoning in smaller models.
Attention maps reveal reasoning-related token attention patterns.
Abstract
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
