Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Marco Valentino; Geonhee Kim; Dhairya Dalal; Zhixue Zhao; Andr\'e Freitas

arXiv:2505.12189·cs.AI·April 2, 2026

Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Marco Valentino, Geonhee Kim, Dhairya Dalal, Zhixue Zhao, Andr\'e Freitas

PDF

1 Video

TL;DR

This paper presents activation steering techniques, including a novel kNN-based method, to mitigate content biases in reasoning tasks of large language models, improving formal reasoning accuracy.

Contribution

It introduces a dynamic, fine-grained activation steering approach, notably the K-CAST method, to effectively reduce content biases in LLM reasoning.

Findings

01

Contrastive steering supports linear control over content biases.

02

Static steering is insufficient for all models tested.

03

Conditional steering with K-CAST improves reasoning accuracy by up to 15%.

Abstract

Large language models (LLMs) exhibit reasoning biases, often conflating content plausibility with formal logical validity. This can lead to wrong inferences in critical domains, where plausible arguments are incorrectly deemed logically valid or vice versa. This paper investigates how content biases on reasoning can be mitigated through activation steering, an inference-time technique that modulates internal activations. Specifically, after localising the layers responsible for formal and plausible inference, we investigate activation steering on a controlled syllogistic reasoning task, designed to disentangle formal validity from content plausibility. An extensive empirical analysis reveals that contrastive steering methods consistently support linear control over content biases. However, a static approach is insufficient to debias all the tested models. We then investigate how to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mitigating Content Effects on Reasoning in Language Models Through Fine-Grained Activation Steering· underline