Stream separation improves Bregman conditioning in transformers
James Clayton Kerce

TL;DR
This paper investigates how stream separation techniques enhance the Bregman geometry conditioning in transformer layers, improving the reliability of linear interventions by addressing the curvature in representation space.
Contribution
It introduces a controlled experimental framework to measure intermediate layer geometry and demonstrates that stream separation significantly improves Bregman metric conditioning in transformers.
Findings
Stream separation increases effective rank by up to 22.
Per-layer supervision has a smaller impact on conditioning.
Cosine similarity predicts steering effectiveness with a threshold near 0.3.
Abstract
Linear methods for steering transformer representations, including probing, activation engineering, and concept erasure, implicitly assume the geometry of representation space is Euclidean. Park et al. [Park et al., 2026] showed that softmax induces a curved Bregman geometry whose metric tensor is the Hessian of the log-normalizer, . Ignoring this curvature causes Euclidean steering to leak probability mass to unintended tokens. Their analysis applies at the output layer. We measure this Hessian at intermediate layers in a controlled 2x2 design crossing stream separation with per-layer supervision (vocabulary decoding loss at each layer), all at matched vocabulary and parameter count. In standard single-stream transformers, H is severely degenerate at intermediate layers (effective rank 8 in 516 dimensions). Stream separation improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
