State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence
Thea Aviss

TL;DR
The paper introduces the State Stream Transformer (SST), an architecture that maintains persistent latent states to enhance reasoning and emergent metacognitive behaviors in language models, leading to significant performance improvements.
Contribution
SST is a novel transformer architecture that incorporates a sliding window latent state with decay, enabling emergent reasoning behaviors without retraining the base model.
Findings
Achieves 89.01% accuracy on GSM-8K in zero-shot
Achieves 91.04% accuracy on ARC Challenge in zero-shot CoT
Demonstrates latent state persistence leads to improved reasoning capabilities
Abstract
We introduce the State Stream Transformer (SST), a novel LLM architecture that reveals emergent reasoning behaviours and capabilities latent in pretrained weights through addressing a fundamental limitation in traditional transformer models: the lack of latent computational continuity across autoregressive generations in the state space. SST introduces a sliding window latent state (FFN) cache with weighted decay that maintains and evolves persistent latent processes throughout autoregressive generations. Through controlled experiments comparing base and SST architectures using the same frozen weights, we demonstrate that this architectural modification alone enables enhanced reasoning capabilities which appear best explained by some form of potential higher-order processing, as evidenced by emergent metacognitive behaviours. These behaviours persist under controlled conditions designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fault Detection and Control Systems
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Balanced Selection · Adam
