TL;DR
TernaryLM introduces a memory-efficient language model using native 1.5-bit ternary quantization with adaptive layer-wise scaling, enabling strong performance with significantly reduced memory footprint.
Contribution
It is the first to train a language model natively with ternary quantization from scratch, achieving memory savings and competitive performance.
Findings
Achieves validation perplexity of 58.42 on TinyStories.
Surpasses DistilBERT in downstream transfer with 82.47% F1 on MRPC.
Reduces memory usage by 2.4x without latency increase.
Abstract
Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M-parameter transformer trained natively with ternary quantization {-1, 0, +1} (log2(3) ~ 1.58-bit effective precision), achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories with a cross-seed standard deviation of +/- 0.17 PPL, confirming stable optimization; (2) strong downstream transfer with 82.47% F1 on MRPC, surpassing DistilBERT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
