Pieces of Eight: 8-bit Neural Machine Translation
Jerry Quinn, Miguel Ballesteros

TL;DR
This paper demonstrates that 8-bit quantization of neural machine translation models significantly improves translation speed without sacrificing accuracy or adequacy, benefiting latency-sensitive and cost-efficient applications.
Contribution
The study shows that 8-bit quantization applied to trained NMT models enhances speed with no loss in translation quality, a novel approach for efficient deployment.
Findings
8-bit quantization increases translation speed.
No degradation in translation quality with 8-bit models.
Applicable to latency-sensitive and cost-efficient scenarios.
Abstract
Neural machine translation has achieved levels of fluency and adequacy that would have been surprising a short time ago. Output quality is extremely relevant for industry purposes, however it is equally important to produce results in the shortest time possible, mainly for latency-sensitive applications and to control cloud hosting costs. In this paper we show the effectiveness of translating with 8-bit quantization for models that have been trained using 32-bit floating point values. Results show that 8-bit translation makes a non-negligible impact in terms of speed with no degradation in accuracy and adequacy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
