TL;DR
The paper introduces Three-Phase Transformer (3PT), a novel residual-stream structural prior for decoder-only Transformers that improves stability and performance by partitioning channels and injecting a fixed position profile.
Contribution
3PT presents a new architecture with phase-respecting operations, channel partitioning, and position injection, demonstrating improved perplexity and convergence speed on language modeling tasks.
Findings
Achieves -7.20% perplexity on WikiText-103 with 123M parameters.
Provides a self-stabilizing equilibrium architecture without explicit constraints.
Shows N=3 as an effective parameter sharing choice, with stability across different N values.
Abstract
We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
