Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse   Activation Control

Yuxin Xiao; Chaoqun Wan; Yonggang Zhang; Wenxiao Wang; Binbin Lin,; Xiaofei He; Xu Shen; Jieping Ye

arXiv:2411.02461·cs.CL·November 6, 2024

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin,, Xiaofei He, Xu Shen, Jieping Ye

PDF

Open Access 1 Video

TL;DR

This paper introduces Sparse Activation Control, a training-free method to enhance LLM trustworthiness by selectively controlling attention heads, enabling simultaneous alignment with human preferences on safety, factuality, and bias.

Contribution

It proposes a novel sparse activation control technique that identifies and manipulates specific attention heads for multi-dimensional trustworthiness in LLMs without additional training.

Findings

01

Models aligned with human preferences on safety, factuality, and bias.

02

Sparse attention heads enable near-independent control over different trustworthiness aspects.

03

The approach works effectively on open-source Llama models.

Abstract

As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this issue through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control· slideslive

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Security and Verification in Computing · Cloud Data Security Solutions

MethodsSoftmax · Attention Is All You Need · ALIGN · LLaMA