Balancing Stylization and Truth via Disentangled Representation Steering
Chenglei Shen, Zhongxiang Sun, Teng Shi, Xiao Zhang, Jun Xu

TL;DR
This paper introduces StyliTruth, a method that disentangles style and truth in language models to enhance stylization control without compromising factual accuracy.
Contribution
It proposes a novel disentangled representation steering mechanism that separates style and truth subspaces, reducing their interference during generation.
Findings
Significantly reduces truthfulness collapse during stylization.
Outperforms existing methods in balancing style and factual accuracy.
Validated across multiple styles and languages.
Abstract
Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model's core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model's representation space via an orthogonal deflation process. This decomposition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
