JoFormer (Journey-based Transformer): Theory and Empirical Analysis on the Tiny Shakespeare Dataset
Mahesh Godavarti

TL;DR
JoFormer introduces a journey-based Transformer architecture that leverages non-commutative algebra for positional encoding, outperforming standard models like RoFormer on the Tiny Shakespeare dataset with lower perplexity and faster convergence.
Contribution
The paper presents a novel journey-based Transformer architecture, JoFormer, grounded in non-commutative algebra, extending relative position representations and subsuming existing methods like rotary transformations.
Findings
JoFormer achieves lower perplexity than RoFormer on Tiny Shakespeare.
JoFormer demonstrates faster convergence in language modeling tasks.
The approach offers a more expressive, principled way to incorporate positional information.
Abstract
Transformers have demonstrated remarkable success in sequence modeling, yet effectively incorporating positional information remains a challenging and active area of research. In this paper, we introduce JoFormer, a journey-based Transformer architecture grounded in a recently proposed non-commutative algebra for composing transformations across positions. JoFormer represents relative positions through learnable directional transforms that are sequentially composed along the input, thereby extending and generalizing existing approaches based on relative position representations. We derive the JoFormer attention mechanism from first principles and show that it subsumes standard methods such as rotary transformations as special cases. To evaluate its effectiveness, we compare JoFormer to the RoFormer baseline on the Tiny Shakespeare character-level language modeling task. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · 3D Shape Modeling and Analysis
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Attention Is All You Need
