Decoupled Transformer for Scalable Inference in Open-domain Question   Answering

Haytham ElFadeel; Stan Peshterliev

arXiv:2108.02765·cs.CL·August 6, 2021

Decoupled Transformer for Scalable Inference in Open-domain Question Answering

Haytham ElFadeel, Stan Peshterliev

PDF

Open Access

TL;DR

This paper introduces a decoupled transformer architecture for open-domain question answering that reduces inference cost and latency by caching offline computations, with minimal accuracy loss.

Contribution

It proposes a novel decoupling method for transformers, combined with knowledge distillation and compression layers, enabling scalable and efficient online QA systems.

Findings

01

Reduces computational cost and latency by 30-40%.

02

Maintains high accuracy with only 1.2 points lower F1-score.

03

Achieves fourfold reduction in cache storage requirements.

Abstract

Large transformer models, such as BERT, achieve state-of-the-art results in machine reading comprehension (MRC) for open-domain question answering (QA). However, transformers have a high computational cost for inference which makes them hard to apply to online QA systems for applications like voice assistants. To reduce computational cost and latency, we propose decoupling the transformer MRC model into input-component and cross-component. The decoupling allows for part of the representation computation to be performed offline and cached for online use. To retain the decoupled transformer accuracy, we devised a knowledge distillation objective from a standard transformer model. Moreover, we introduce learned representation compression layers which help reduce by four times the storage requirement for the cache. In experiments on the SQUAD 2.0 dataset, a decoupled transformer reduces the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Knowledge Distillation · Multi-Head Attention · Layer Normalization · WordPiece · Softmax · Dropout · Dense Connections