DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling
Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung, Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh

TL;DR
DiPair is a new distillation framework that significantly speeds up text pair modeling while maintaining high accuracy, making it practical for large-scale NLP applications.
Contribution
The paper introduces DiPair, a scalable distillation method specifically optimized for text pair tasks, improving speed and accuracy over existing approaches.
Findings
Achieves over 350x speedup compared to BERT
Maintains minimal quality loss in text pair tasks
Effective on both academic and real-world e-commerce benchmarks
Abstract
Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. However -- as we show here -- existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work, we propose DiPair -- a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Knowledge Distillation · Dense Connections · Layer Normalization · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay
