BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online   E-Commerce Search

Yunjiang Jiang; Yue Shang; Ziyang Liu; Hongwei Shen; Yun Xiao; Wei; Xiong; Sulong Xu; Weipeng Yan; Di Jin

arXiv:2010.10442·cs.LG·October 21, 2020

BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

Yunjiang Jiang, Yue Shang, Ziyang Liu, Hongwei Shen, Yun Xiao, Wei, Xiong, Sulong Xu, Weipeng Yan, Di Jin

PDF

Open Access

TL;DR

This paper introduces BERT2DNN, a distillation framework that converts large Transformer models into efficient feed-forward networks for e-commerce search relevance, achieving high accuracy with significantly reduced latency and energy consumption.

Contribution

The work presents a novel distillation method that leverages unlabeled data and model stacking to produce lightweight models with near-Transformer accuracy for search relevance.

Findings

01

Student model recovers over 97% of teacher accuracy.

02

Latency reduced by up to 150x compared to BERT-Base.

03

Method improves accuracy without increasing model complexity.

Abstract

Relevance has significant impact on user experience and business profit for e-commerce search platform. In this work, we propose a data-driven framework for search relevance prediction, by distilling knowledge from BERT and related multi-layer Transformer teacher models into simple feed-forward networks with large amount of unlabeled data. The distillation process produces a student model that recovers more than 97\% test accuracy of teacher models on new queries, at a serving cost that's several magnitude lower (latency 150x lower than BERT-Base and 15x lower than the most efficient BERT variant, TinyBERT). The applications of temperature rescaling and teacher model stacking further boost model accuracy, without increasing the student model complexity. We present experimental results on both in-house e-commerce search relevance data as well as a public data set on sentiment analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dense Connections · Multi-Head Attention · Label Smoothing · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay