Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift
Sheer Karny, Anthony Baez, and Pat Pataranutaporn

TL;DR
This paper introduces multi-turn neural transparency, a real-time visualization of neural activations in LLMs, to improve user understanding and calibration of model behavior across conversations.
Contribution
It presents a novel interface that visualizes neural trait expressions in real time, enhancing user ability to anticipate and evaluate model behavior.
Findings
Neural transparency significantly improved trait prediction accuracy (d = -0.34 to -0.49).
Visualization outperformed static single-turn visualization (d = -0.32).
Transparency reduced overconfidence among users.
Abstract
Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM's internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression () via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N =…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
