Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Leon Eshuijs; Shihan Wang; Antske Fokkens

arXiv:2505.06032·cs.LG·May 12, 2025

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Leon Eshuijs, Shihan Wang, Antske Fokkens

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how language models process shortcuts in text classification, revealing that specific attention heads prematurely focus on shortcuts, and introduces HTA to detect and mitigate these shortcuts effectively.

Contribution

It uncovers the internal mechanisms of shortcut processing in models and proposes HTA for targeted detection and mitigation of shortcuts.

Findings

01

Attention heads focus on shortcuts early in processing

02

HTA effectively detects shortcuts in large language models

03

Selective deactivation of shortcut heads mitigates shortcut reliance

Abstract

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

watermeleon/shortcut_mechanisms
jaxOfficial

Videos

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification· underline

Taxonomy

TopicsTopic Modeling · Authorship Attribution and Profiling · Explainable Artificial Intelligence (XAI)

MethodsSoftmax · Attention Is All You Need · Focus