ExpertSteer: Intervening in LLMs through Expert Knowledge

Weixuan Wang; Minghao Wu; Barry Haddow; Alexandra Birch

arXiv:2505.12313·cs.CL·September 29, 2025

ExpertSteer: Intervening in LLMs through Expert Knowledge

Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

PDF

Open Access 1 Repo 3 Reviews

TL;DR

ExpertSteer introduces a novel activation steering method that leverages external expert models to guide large language models during inference, enabling flexible and effective intervention without retraining.

Contribution

It proposes a new approach to steer LLMs using external expert models, overcoming limitations of existing methods that rely on model-internal vectors.

Findings

01

Outperforms baseline methods across 15 benchmarks

02

Effective intervention across three different LLMs

03

Minimal additional computational cost

Abstract

Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model's behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Introduces a generalizable and model-agnostic steering framework that integrates external expert knowledge. - Comprehensive experiments across multiple domains and model families support the proposed method’s effectiveness. - Demonstrates both same-family and cross-family transfer, showing robustness. - Computationally efficient with negligible inference overhead. - Clear methodology and detailed ablation studies on design choices (RFMs, alignment order, expert selection).

Weaknesses

- Limited dataset diversity. Several benchmarks such as MMLU variants are overused, which may overstate generalization. - Scalability concerns remain, as only small-to-medium LLMs are tested. It is unclear how EXPERTSTEER performs for much larger models. - The mutual information-based layer pairing and RFM feature extraction steps, while intuitive, would benefit from stronger theoretical or empirical justification. - The paper does not analyze potential risks of applying expert interventions

Reviewer 02Rating 4Confidence 4

Strengths

1. The method provides a new way to choose layers for steering using Mutual Information criterion and the idea of matching layers between expert and target models is very interesting. 2. The authors introduce a novel idea of calculating steering direction as the direction capturing most variation in the discriminative feature space. 3. Diversing empirical experiments as well as detailed ablation studies.

Weaknesses

1. The way the Auto-encoder is trained on expert model's feature space is only through reconstruction loss, which lack the regularization for the hidden state space of the auto-encoder to align with the feature space of the target model. Taking what is written in the paper, the authors only use an affine layer for the encoder/decoder; this allows for spurious hidden feature spaces that is may not align with the target model at all: if the expert model's feature space dimension ($d_E$) is smaller

Reviewer 03Rating 4Confidence 3

Strengths

1. Breaks model dependency in activation steering, enables cross-model knowledge transfer. ExpertSteer introduces external expert models to generate steering vectors. It not only injects domain-specific knowledge absent in the target model, but also aligns feature dimensions via auto-encoders and matches intervention layers using mutual information analysis. 2. Accurately captures expert knowledge, enhances steering vector effectiveness. Traditional methods often rely on linear feature extract

Weaknesses

1. The effectiveness of EXPERTSTEER heavily depends on the expertise of the expert model and the quality of training data. If the expert model lacks domain knowledge (e.g., using a general-purpose model instead of a domain expert) or the training data contains mislabeled samples (e.g., incorrect medical classifications), the steering vector’s effectiveness may significantly reduced. Since the experiments show that using a general model (e.g., Llama-3-8B) to generate steering vectors only yields

Code & Models

Repositories

weixuan-wang123/expertsteer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education