State Stream Transformer (SST) : Emergent Metacognitive Behaviours   Through Latent State Persistence

Thea Aviss

arXiv:2501.18356·cs.LG·January 31, 2025

State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence

Thea Aviss

PDF

Open Access

TL;DR

The paper introduces the State Stream Transformer (SST), an architecture that maintains persistent latent states to enhance reasoning and emergent metacognitive behaviors in language models, leading to significant performance improvements.

Contribution

SST is a novel transformer architecture that incorporates a sliding window latent state with decay, enabling emergent reasoning behaviors without retraining the base model.

Findings

01

Achieves 89.01% accuracy on GSM-8K in zero-shot

02

Achieves 91.04% accuracy on ARC Challenge in zero-shot CoT

03

Demonstrates latent state persistence leads to improved reasoning capabilities

Abstract

We introduce the State Stream Transformer (SST), a novel LLM architecture that reveals emergent reasoning behaviours and capabilities latent in pretrained weights through addressing a fundamental limitation in traditional transformer models: the lack of latent computational continuity across autoregressive generations in the state space. SST introduces a sliding window latent state (FFN) cache with weighted decay that maintains and evolves persistent latent processes throughout autoregressive generations. Through controlled experiments comparing base and SST architectures using the same frozen weights, we demonstrate that this architectural modification alone enables enhanced reasoning capabilities which appear best explained by some form of potential higher-order processing, as evidenced by emergent metacognitive behaviours. These behaviours persist under controlled conditions designed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Fault Detection and Control Systems

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Balanced Selection · Adam