Binary and Ternary Natural Language Generation
Zechun Liu, Barlas Oguz, Aasish Pappu, Yangyang Shi, Raghuraman, Krishnamoorthi

TL;DR
This paper introduces the first ternary and binary transformer models for text summarization and translation, achieving competitive performance with significantly higher efficiency through novel quantization techniques.
Contribution
It presents a new approach combining statistics-based weight quantization and elastic activation quantization to enable effective training of ternary and binary transformers.
Findings
Ternary BART achieves 41 R1 score on CNN/DailyMail, close to full model.
Binary model achieves a BLEU score of 35.6 on WMT16 En-Ro.
Models outperform some 8-bit weight models in the literature.
Abstract
Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection
