Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning,   and Interpretability through Attention Maps

Yen-Che Hsiao; Abhishek Dutta

arXiv:2502.15120·cs.CL·February 24, 2025

Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps

Yen-Che Hsiao, Abhishek Dutta

PDF

TL;DR

This paper explores how model size, fine-tuning, and interpretability affect reasoning in decoder-only transformer models, revealing a size threshold for reasoning performance and insights from attention maps.

Contribution

It identifies a critical size threshold (~1.6B parameters) for reasoning improvements and demonstrates fine-tuning enhances reasoning in smaller models, with interpretability insights from attention analysis.

Findings

01

Models above 1.6B parameters excel in reasoning tasks.

02

Fine-tuning improves reasoning in smaller models.

03

Attention maps reveal reasoning-related token attention patterns.

Abstract

This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.