Decoupled Transformer for Scalable Inference in Open-domain Question Answering
Haytham ElFadeel, Stan Peshterliev

TL;DR
This paper introduces a decoupled transformer architecture for open-domain question answering that reduces inference cost and latency by caching offline computations, with minimal accuracy loss.
Contribution
It proposes a novel decoupling method for transformers, combined with knowledge distillation and compression layers, enabling scalable and efficient online QA systems.
Findings
Reduces computational cost and latency by 30-40%.
Maintains high accuracy with only 1.2 points lower F1-score.
Achieves fourfold reduction in cache storage requirements.
Abstract
Large transformer models, such as BERT, achieve state-of-the-art results in machine reading comprehension (MRC) for open-domain question answering (QA). However, transformers have a high computational cost for inference which makes them hard to apply to online QA systems for applications like voice assistants. To reduce computational cost and latency, we propose decoupling the transformer MRC model into input-component and cross-component. The decoupling allows for part of the representation computation to be performed offline and cached for online use. To retain the decoupled transformer accuracy, we devised a knowledge distillation objective from a standard transformer model. Moreover, we introduce learned representation compression layers which help reduce by four times the storage requirement for the cache. In experiments on the SQUAD 2.0 dataset, a decoupled transformer reduces the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Knowledge Distillation · Multi-Head Attention · Layer Normalization · WordPiece · Softmax · Dropout · Dense Connections
