Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin,, Xiaofei He, Xu Shen, Jieping Ye

TL;DR
This paper introduces Sparse Activation Control, a training-free method to enhance LLM trustworthiness by selectively controlling attention heads, enabling simultaneous alignment with human preferences on safety, factuality, and bias.
Contribution
It proposes a novel sparse activation control technique that identifies and manipulates specific attention heads for multi-dimensional trustworthiness in LLMs without additional training.
Findings
Models aligned with human preferences on safety, factuality, and bias.
Sparse attention heads enable near-independent control over different trustworthiness aspects.
The approach works effectively on open-source Llama models.
Abstract
As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this issue through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Security and Verification in Computing · Cloud Data Security Solutions
MethodsSoftmax · Attention Is All You Need · ALIGN · LLaMA
