Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for   Low-Latency Inference in NLP Applications

Matthew Khoury; Rumen Dangovski; Longwu Ou; Preslav Nakov and; Yichen Shen; Li Jing

arXiv:2010.08412·cs.CL·October 19, 2020

Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications

Matthew Khoury, Rumen Dangovski, Longwu Ou, Preslav Nakov and, Yichen Shen, Li Jing

PDF

Open Access

TL;DR

This paper introduces VVMA, a hardware-aware architecture that significantly reduces inference latency in NLP models like NMT by leveraging specialized low-latency vector-vector operations, with minimal accuracy loss.

Contribution

The paper proposes VVMA, a novel hardware-aware framework that decreases inference latency and parameters for NLP models, extending its applicability beyond NMT.

Findings

01

Reduces NMT inference latency by a factor of four.

02

Lowers model parameters and FLOPs with minimal accuracy impact.

03

Potential extension to other domains and hardware efficiency improvements.

Abstract

Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time. To address this issue, we propose a novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at inference time for NMT. This architecture takes advantage of specialized hardware that has low-latency vector-vector operations and higher-latency vector-matrix operations. It also reduces the number of parameters and FLOPs for virtually all models that rely on efficient matrix multipliers without significantly impacting accuracy. We present empirical results suggesting that our framework can reduce the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Label Smoothing