RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models
David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He

TL;DR
This paper benchmarks quantization aware training techniques for 4-bit seq2seq models across speech and translation tasks, identifies limitations of noise-based QAT, and proposes simple improvements to enhance accuracy and generalization.
Contribution
It provides a comprehensive benchmark of QAT techniques on seq2seq models and introduces low complexity modifications to improve noise-based QAT performance.
Findings
Noise-based QAT struggles with insufficient regularization.
Simple modifications outperform popular methods like learnable scale and clipping.
Enhanced QAT enables mixed precision training and better long-form speech recognition.
Abstract
With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models. Despite recent advances in quantization aware training (QAT) technique, most papers present evaluations that are focused on computer vision tasks, which have different training dynamics compared to sequence tasks. In this paper, we first benchmark the impact of popular techniques such as straight through estimator, pseudo-quantization noise, learnable scale parameter, clipping, etc. on 4-bit seq2seq models across a suite of speech recognition datasets ranging from 1,000 hours to 1 million hours, as well as one machine translation dataset to illustrate its applicability outside of speech. Through the experiments, we report that noise based QAT suffers when…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper considers an important problem of quantization of neural network weight parameters, from the perspective of outlier robustness (i.e., preventing quantizer unsaturation). the proposed quantization algorithm has tunable knobs (such as the hyper-parameter $p$), which allows the user the freedom to control the extent of robustness desired. Whats quite commendable is the extensive numerical evaluations on sequence-to-sequence models, on different tasks and different neural network archite
I have a few concerns with this work, and I would highly appreciate it if the authors could elaborate upon those. I would be more than happy to rectify my review and/or increase my score post the rebuttal period. 1. Is there anything specific in this paper that is concerned with sequence-to-sequence models? It seems that the general quantization aware training strategy can be extended to even other models. Please correct me if I am mistaken. And if there is anything specific with the proposed r
1. The proposed method shows superior performance to the evaluated baselines on the speech recognition task. 2. The proposed approach is simple and intuitive. 3. The authors conduct experiments on both small-scale and large-scale datasets.
1. It seems the proposed method is a special case of previously proposed methods, which tackle the more general case. 2. The comparison to prior work can be more thorough. The authors could compare to prior work also considering other tasks and the large-scale models. Similarly, the authors could provide results for more standard benchmarks in the literature, it would be easier to compare to prior work.
* The paper builds the relationship between quantization ranges and the Lp norm in a formulated way, which is very interesting for weight quantization. * The paper analyzes different ways that set p and c to achieve different quantization ranges. * Experiments have been done on different datasets and different sizes of models. Especially, some tasks contain large datasets, which would cost a large effort to train them.
* Mode 1 looks like PQN with uniform noise and no clipping method. I feel this is like the method Relaxed quantization [1], which also views the quantization as variational noise and includes stochastic rounding as a special case. Meanwhile, I have some confusion about Mode 1 in experiment settings. What is the difference between Mode 1 + 4-bit STE and None + 4-bit STE? I expect None means no clipping method. Also, if Mode 1 + 4-bit STE means STE without any clipping, there is nothing new and is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Domain Adaptation and Few-Shot Learning
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Attentive Walk-Aggregating Graph Neural Network
