Towards smaller, faster decoder-only transformers: Architectural variants and their implications
Sathya Krishnan Suresh, Shunmugapriya P

TL;DR
This paper introduces three architectural variants of decoder-only transformers—ParallelGPT, LinearGPT, and ConvGPT—that achieve similar performance to traditional models while being smaller and faster to train, with open-source code provided.
Contribution
The study proposes three novel transformer architectures that reduce model size and training time without sacrificing language generation quality.
Findings
Comparable performance to standard transformers in language tasks
Reduced model sizes and faster training times
Open-source implementation available
Abstract
In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Advanced Data Storage Technologies · VLSI and FPGA Design Techniques
