ExpertSteer: Intervening in LLMs through Expert Knowledge
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

TL;DR
ExpertSteer introduces a novel activation steering method that leverages external expert models to guide large language models during inference, enabling flexible and effective intervention without retraining.
Contribution
It proposes a new approach to steer LLMs using external expert models, overcoming limitations of existing methods that rely on model-internal vectors.
Findings
Outperforms baseline methods across 15 benchmarks
Effective intervention across three different LLMs
Minimal additional computational cost
Abstract
Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model's behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning…
Peer Reviews
Decision·Submitted to ICLR 2026
- Introduces a generalizable and model-agnostic steering framework that integrates external expert knowledge. - Comprehensive experiments across multiple domains and model families support the proposed method’s effectiveness. - Demonstrates both same-family and cross-family transfer, showing robustness. - Computationally efficient with negligible inference overhead. - Clear methodology and detailed ablation studies on design choices (RFMs, alignment order, expert selection).
- Limited dataset diversity. Several benchmarks such as MMLU variants are overused, which may overstate generalization. - Scalability concerns remain, as only small-to-medium LLMs are tested. It is unclear how EXPERTSTEER performs for much larger models. - The mutual information-based layer pairing and RFM feature extraction steps, while intuitive, would benefit from stronger theoretical or empirical justification. - The paper does not analyze potential risks of applying expert interventions
1. The method provides a new way to choose layers for steering using Mutual Information criterion and the idea of matching layers between expert and target models is very interesting. 2. The authors introduce a novel idea of calculating steering direction as the direction capturing most variation in the discriminative feature space. 3. Diversing empirical experiments as well as detailed ablation studies.
1. The way the Auto-encoder is trained on expert model's feature space is only through reconstruction loss, which lack the regularization for the hidden state space of the auto-encoder to align with the feature space of the target model. Taking what is written in the paper, the authors only use an affine layer for the encoder/decoder; this allows for spurious hidden feature spaces that is may not align with the target model at all: if the expert model's feature space dimension ($d_E$) is smaller
1. Breaks model dependency in activation steering, enables cross-model knowledge transfer. ExpertSteer introduces external expert models to generate steering vectors. It not only injects domain-specific knowledge absent in the target model, but also aligns feature dimensions via auto-encoders and matches intervention layers using mutual information analysis. 2. Accurately captures expert knowledge, enhances steering vector effectiveness. Traditional methods often rely on linear feature extract
1. The effectiveness of EXPERTSTEER heavily depends on the expertise of the expert model and the quality of training data. If the expert model lacks domain knowledge (e.g., using a general-purpose model instead of a domain expert) or the training data contains mislabeled samples (e.g., incorrect medical classifications), the steering vector’s effectiveness may significantly reduced. Since the experiments show that using a general model (e.g., Llama-3-8B) to generate steering vectors only yields
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
