Disposition Distillation at Small Scale: A Three-Arc Negative Result

Hari Sadasivan (Tinman Lab)

arXiv:2604.11867·cs.LG·April 15, 2026

Disposition Distillation at Small Scale: A Three-Arc Negative Result

Hari Sadasivan (Tinman Lab)

PDF

TL;DR

This study attempted to imbue small language models with behavioral dispositions through various distillation and intervention techniques, but found no effective method that improves disposition without harming content quality.

Contribution

It provides a comprehensive negative result showing the difficulty of modifying dispositions in small language models and introduces a taxonomy of failure modes.

Findings

01

Falsified initial reported gains in disposition metrics.

02

No intervention improved dispositions without content damage.

03

Gemma 4 E2B shows confidence-correctness decoupling.

Abstract

We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.