Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
Leon Eshuijs, Shihan Wang, Antske Fokkens

TL;DR
This paper investigates how language models process shortcuts in text classification, revealing that specific attention heads prematurely focus on shortcuts, and introduces HTA to detect and mitigate these shortcuts effectively.
Contribution
It uncovers the internal mechanisms of shortcut processing in models and proposes HTA for targeted detection and mitigation of shortcuts.
Findings
Attention heads focus on shortcuts early in processing
HTA effectively detects shortcuts in large language models
Selective deactivation of shortcut heads mitigates shortcut reliance
Abstract
Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Explainable Artificial Intelligence (XAI)
MethodsSoftmax · Attention Is All You Need · Focus
