CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
Xintong Wang, Jingheng Pan, Liang Ding, Longyue Wang, Longqin Jiang, Xingshan Li, Chris Biemann

TL;DR
This paper introduces CogSteer, a cognition-inspired, layer-specific intervention method for steering large language models efficiently and interpretably, improving safety and task performance.
Contribution
It proposes a novel heuristic for selecting optimal intervention layers based on cognitive insights and introduces an implicit contrastive method for safer LLM deployment.
Findings
Effective in reducing toxic outputs during inference
Improves task performance with fewer parameters
Enhances interpretability of LLMs
Abstract
Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data. This enables efficient adaptation to diverse downstream tasks. However, the lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications. In this work, we investigate the intrinsic mechanisms of LLMs from a cognitive perspective using eye movement measures. Specifically, we analyze the layer-wise correlation between human cognitive indicators and LLM representations. Building on these insights, we propose a heuristic approach for selecting the optimal steering layer to modulate LLM semantics. To this end, we introduce an efficient selective layer intervention based on prominent parameter-efficient fine-tuning methods, which conventionally adjust either all layers or only the final layer. Additionally, we present an implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
