DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching   and Pair Modeling

Jiecao Chen; Liu Yang; Karthik Raman; Michael Bendersky; Jung-Jung; Yeh; Yun Zhou; Marc Najork; Danyang Cai; Ehsan Emadzadeh

arXiv:2010.03099·cs.CL·May 6, 2021

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung, Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh

PDF

TL;DR

DiPair is a new distillation framework that significantly speeds up text pair modeling while maintaining high accuracy, making it practical for large-scale NLP applications.

Contribution

The paper introduces DiPair, a scalable distillation method specifically optimized for text pair tasks, improving speed and accuracy over existing approaches.

Findings

01

Achieves over 350x speedup compared to BERT

02

Maintains minimal quality loss in text pair tasks

03

Effective on both academic and real-world e-commerce benchmarks

Abstract

Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. However -- as we show here -- existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work, we propose DiPair -- a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Knowledge Distillation · Dense Connections · Layer Normalization · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay