Pieces of Eight: 8-bit Neural Machine Translation

Jerry Quinn; Miguel Ballesteros

arXiv:1804.05038·cs.CL·April 16, 2018

Pieces of Eight: 8-bit Neural Machine Translation

Jerry Quinn, Miguel Ballesteros

PDF

TL;DR

This paper demonstrates that 8-bit quantization of neural machine translation models significantly improves translation speed without sacrificing accuracy or adequacy, benefiting latency-sensitive and cost-efficient applications.

Contribution

The study shows that 8-bit quantization applied to trained NMT models enhances speed with no loss in translation quality, a novel approach for efficient deployment.

Findings

01

8-bit quantization increases translation speed.

02

No degradation in translation quality with 8-bit models.

03

Applicable to latency-sensitive and cost-efficient scenarios.

Abstract

Neural machine translation has achieved levels of fluency and adequacy that would have been surprising a short time ago. Output quality is extremely relevant for industry purposes, however it is equally important to produce results in the shortest time possible, mainly for latency-sensitive applications and to control cloud hosting costs. In this paper we show the effectiveness of translating with 8-bit quantization for models that have been trained using 32-bit floating point values. Results show that 8-bit translation makes a non-negligible impact in terms of speed with no degradation in accuracy and adequacy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings