Steering Conceptual Bias via Transformer Latent-Subspace Activation

Vansh Sharma; Venkat Raman

arXiv:2506.18887·cs.AI·June 24, 2025

Steering Conceptual Bias via Transformer Latent-Subspace Activation

Vansh Sharma, Venkat Raman

PDF

TL;DR

This paper introduces a gradient-refined activation steering method to bias language model code generation towards specific programming languages, improving control and reproducibility.

Contribution

It develops G-ACT, an adaptive activation steering framework that effectively biases large language models towards desired programming languages with minimal overhead.

Findings

01

G-ACT increases probe classification accuracy by 15% in LLaMA-3.2 3B.

02

Targeted injections improve language bias even in diffuse attention models.

03

Per-layer probing enables practical and reproducible concept control.

Abstract

This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training