YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang

TL;DR
YaPO introduces a novel sparse steering vector method for LLM alignment, offering faster convergence, better stability, and fine-grained control without degrading general knowledge, applicable across various alignment tasks.
Contribution
The paper proposes YaPO, a reference-free approach that learns sparse, interpretable steering vectors using a Sparse Autoencoder, improving stability and effectiveness in fine-grained LLM alignment.
Findings
YaPO converges faster than dense baselines.
YaPO achieves stronger alignment performance.
YaPO maintains general knowledge without degradation.
Abstract
Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a \textit{reference-free} method that learns \textit{sparse steering vectors} in the latent space of a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The method converges much faster than BiPO and outpeforms BiPO in the cultural benchmark. * They also perform their method on BiPO's benchmarks, but only on the hallucinations dataset (which they note in their Limitations section)
* The work was done only on a single 2B model. The 9B variation was mentioned once in Limitations with no further details in the main body or Appendix. * The paper claims to produce more interpretable steering directions, but fails to do any work on interpreting the steering direction. They note that this is "beyond the scope of this paper" in the Limitations, but I disagree, as merely using the sparse autoencoder feature basis is not sufficient to make things more interpretable. * While the dat
+ It is good to see more research on joining SAE and steering vectors. + The cultural localization problem the authors put forward and gathered a dataset for is an interesting problem and can be a good addition to existing tasks for benchmarking model behavior manipulation.
+ The idea of bridging SAE and steering vectors are not exactly new. For instance [1] and [2] both have investigated how sparsity/monosementicity helps regularizes representation steering. In a way, YaPO can be considered merely using BiPO to achieve [2]. + While BiPO is a very good paper to base on, using it as the only baseline is inadequate, given that there are existing works that shared the same design as mentioned above. + The experiments are also limited. + Gemma is the only model be
- First method to combine preference optimization with sparse steering vectors in SAE latent space, addressing limitations of both dense steering (BiPO) and static sparse methods (SAS) - Demonstrates order-of-magnitude faster convergence and consistent performance improvements across all evaluated languages and settings - Curates a high-quality multilingual dataset (45,354 items) with careful controls for dialect, cultural validity, and localized/non-localized variants - Introduces PNLG and R
- My biggest concern with this paper is the lack of baselines regarding steering with SAE. The authors did not compare against some new baselines like ReFT-r1, RePS, HyperSteer, and EasyEdit2. Since these methods also leverage SAE-based representations for steering, this omission makes it difficult to assess whether YaPO's improvements are genuinely novel. - I am a little bit concerned about the limited model coverage. YaPO is only evaluated on Gemma-2-2B (briefly mentions Gemma-2-9B), lacking
1. The proposed approach is intuitive and easy to understand. 2. In cultural adaptation tasks, it achieves significant improvements over the baseline method, BiPO.
1. The experiments were only conducted on Gemma2-2B, and the results need to be validated on more models to demonstrate generality. 2. The baselines are limited. The paper only compares against BiPO, while there are many existing works on sparse activation steering that should be included for a more comprehensive comparison. 3. The tasks are restricted to cultural adaptation, and although the authors created their own dataset, the description of the task is vague. It is difficult to understand
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
