Dialz: A Python Toolkit for Steering Vectors

Zara Siddique; Liam D. Turner; Luis Espinosa-Anke

arXiv:2505.06262·cs.LG·June 4, 2025

Dialz: A Python Toolkit for Steering Vectors

Zara Siddique, Liam D. Turner, Luis Espinosa-Anke

PDF

Open Access 1 Video

TL;DR

Dialz is a Python toolkit that enables researchers to manipulate and analyze language model activations to control concepts like honesty or positivity, improving interpretability and safety.

Contribution

It introduces a modular, user-friendly framework for steering vectors in open-source LLMs, supporting diverse tasks and enhancing research efficiency.

Findings

01

Reduces harmful stereotypes in language models.

02

Provides insights into model layer behaviors.

03

Facilitates safer and more transparent AI systems.

Abstract

We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dialz: A Python Toolkit for Steering Vectors· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning