Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya; Ayush Kaushal; Vineet Jain; Francis Couture Harpin; Prashant Shishodia; Majid Behbahani; Yuriy Nevmyvaka; Irina Rish

arXiv:2506.23025·cs.LG·July 1, 2025

Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish

PDF

Open Access

TL;DR

This paper introduces Spectra-1.1, a suite of ternary language models trained on extensive data, and proposes novel packing schemes and a GPU kernel to significantly enhance inference efficiency, addressing memory bottlenecks in large language models.

Contribution

The paper presents Spectra-1.1, a large-scale suite of TriLMs, and introduces innovative 2-bit and 1.6-bit packing schemes along with TriRun GPU kernel for faster inference.

Findings

01

TriLMs benefit more from increased data than model scaling.

02

Proposed packing schemes accelerate CPU inference.

03

TriRun GPU kernel achieves up to 5x speedup.

Abstract

Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Natural Language Processing Techniques