Steering Conceptual Bias via Transformer Latent-Subspace Activation
Vansh Sharma, Venkat Raman

TL;DR
This paper introduces a gradient-refined activation steering method to bias language model code generation towards specific programming languages, improving control and reproducibility.
Contribution
It develops G-ACT, an adaptive activation steering framework that effectively biases large language models towards desired programming languages with minimal overhead.
Findings
G-ACT increases probe classification accuracy by 15% in LLaMA-3.2 3B.
Targeted injections improve language bias even in diffuse attention models.
Per-layer probing enables practical and reproducible concept control.
Abstract
This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
