Dialz: A Python Toolkit for Steering Vectors
Zara Siddique, Liam D. Turner, Luis Espinosa-Anke

TL;DR
Dialz is a Python toolkit that enables researchers to manipulate and analyze language model activations to control concepts like honesty or positivity, improving interpretability and safety.
Contribution
It introduces a modular, user-friendly framework for steering vectors in open-source LLMs, supporting diverse tasks and enhancing research efficiency.
Findings
Reduces harmful stereotypes in language models.
Provides insights into model layer behaviors.
Facilitates safer and more transparent AI systems.
Abstract
We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
