Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

Yuntai Bao; Xuhong Zhang; Jintao Chen; Ge Su; Yuxiang Cai; Hao Peng; Bing Sun; Haiqin Weng; Liu Yan; Jianwei Yin

arXiv:2602.05234·cs.LG·March 17, 2026

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces Concept DAS (CDAS), a novel intervention-based model steering method that aligns internal model mechanisms with desired outputs using distribution matching, enabling more faithful and stable control.

Contribution

It proposes a new distribution matching objective and bi-directional interventions for model steering, improving faithfulness and reducing hyperparameter tuning efforts.

Findings

01

CDAS performs well on AxBench benchmark.

02

It effectively overrides safety refusal behaviors.

03

It maintains model utility in safety-related case studies.

Abstract

Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The CDAS objective encourages the model to learn concepts that are aligned with the model’s overall output distribution under the concept-induced input. Consequently, supervision does not come directly from ground-truth responses, but rather from the model’s own internal distribution. This is an interesting idea, as it may lead to outputs that are more naturally aligned with the inherent responses of LLMs. In the refusal override experiments, CDAS achieves the best KL divergence loss, while mai

Weaknesses

While the premise behind CDAS and its training objective is compelling, the results are mixed. For example, in the experiments presented in Table 1, CDAS achieves the best performance on Gemma-2-9 L20 under a tuned factor, outperforming all other methods. However, on other intervention layers and with smaller models (e.g., 2B), CDAS fails to surpass RePS—although it still outperforms DiM, BiPO, and, in two cases, Lang. In the refusal override experiments, CDAS also underperforms on the smaller

Reviewer 02Rating 8Confidence 3

Strengths

The paper tackles a compelling and timely problem, presenting a solution that is both concise and elegant. The manuscript is well-written and structured, making the methodology and results accessible. The authors conduct extensive experiments to validate CDAS, providing thorough comparisons with existing approaches. Detailed experimental protocols and results are available in the supplemental material, enhancing transparency and reproducibility.

Weaknesses

Despite its merits, there are a few aspects that require clarification or further analysis: (1) In certain experiments, CDAS underperforms relative to baselines (e.g., Tables 1 and 3). The authors should provide insights or hypotheses explaining these performance gaps. (2) It remains unclear under which conditions CDAS excels and under which scenarios it may fall short. A discussion of the limitations and situational strengths of the method would strengthen the paper.

Reviewer 03Rating 4Confidence 1

Strengths

- Interesting conceptual shift linking steering with causal localization and interpretability. - Comprehensive experiments across benchmarks and safety settings.

Weaknesses

I am very much an outsider to the “model steering” field, however, unfortunately, this paper does a weak job at presenting much needed context for new readers to appreciate the why and how of their manuscript Much of the structure and writing assumes readers are familiar with extant work and understand their shortcomings e.g., * [l42/46] how does “intervention-based” result in “optimization-based”? * [l52] what does “degenerate, repetitive generations” even mean? * [l55] why should the reade

Code & Models

Datasets

colored-dye/concept500_contrastive
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Advanced Causal Inference Techniques · Explainable Artificial Intelligence (XAI)