SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Shourya Batra; Pierce Tillman; Samarth Gaggar; Shashank Kesineni; Kevin Zhu; Sunishchal Dev; Ashwinee Panda; Vasu Sharma; Maheep Chaudhary

arXiv:2511.07772·cs.CR·November 24, 2025

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Shourya Batra, Pierce Tillman, Samarth Gaggar, Shashank Kesineni, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma, Maheep Chaudhary

PDF

Open Access

TL;DR

SALT is a lightweight test-time intervention that reduces privacy leakage in large language models' reasoning traces by injecting targeted steering vectors, balancing privacy protection with reasoning utility.

Contribution

We introduce SALT, a novel method that mitigates internal privacy leakage in LLMs during reasoning without sacrificing performance, by identifying and steering high-leakage layers.

Findings

01

Achieves up to 31.2% reduction in privacy leakage across datasets.

02

Maintains comparable task performance and utility.

03

Effective across multiple large language models.

Abstract

As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model's reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI