SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
Wei Xia, Zhi-Hong Deng

TL;DR
SDA is a training-free, model-agnostic framework that dynamically aligns open-source LLM outputs with human intent during inference, improving helpfulness, harmlessness, and honesty without retraining.
Contribution
SDA introduces a novel, lightweight, and resource-efficient method for aligning LLM behavior with human preferences during inference without fine-tuning.
Findings
Achieves 64.4% improvement in helpfulness
Increases honesty by 30%
Enhances harmlessness by 11.5% across 8 open-source LLMs
Abstract
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
