Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications
Matthew Khoury, Rumen Dangovski, Longwu Ou, Preslav Nakov and, Yichen Shen, Li Jing

TL;DR
This paper introduces VVMA, a hardware-aware architecture that significantly reduces inference latency in NLP models like NMT by leveraging specialized low-latency vector-vector operations, with minimal accuracy loss.
Contribution
The paper proposes VVMA, a novel hardware-aware framework that decreases inference latency and parameters for NLP models, extending its applicability beyond NMT.
Findings
Reduces NMT inference latency by a factor of four.
Lowers model parameters and FLOPs with minimal accuracy impact.
Potential extension to other domains and hardware efficiency improvements.
Abstract
Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time. To address this issue, we propose a novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at inference time for NMT. This architecture takes advantage of specialized hardware that has low-latency vector-vector operations and higher-latency vector-matrix operations. It also reduces the number of parameters and FLOPs for virtually all models that rely on efficient matrix multipliers without significantly impacting accuracy. We present empirical results suggesting that our framework can reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Label Smoothing
