SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance
Li Lyna Zhang, Youkow Homma, Yujing Wang, Min Wu, Mao Yang, Ruofei, Zhang, Ting Cao, Wei Shen

TL;DR
This paper introduces SwiftPruner, a reinforcement learning-based framework for structured pruning of BERT models, enabling efficient real-time ad relevance inference for cold start ads with improved accuracy and latency.
Contribution
It proposes a novel reinforcement learning approach with a latency-aware reward to optimize layer-wise sparsity in BERT for low-latency ad relevance tasks.
Findings
Achieves higher ROC AUC than uniform sparsity baselines.
Reduces cold start ad defect ratio by 11.7%.
Meets latency constraints with improved model performance.
Abstract
Ad relevance modeling plays a critical role in online advertising systems including Microsoft Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing approaches perform ad-side computations offline. While efficient, these approaches are unable to serve cold start ads, resulting in poor relevance predictions for such ads. This work aims to design a new, low-latency BERT via structured pruning to empower real-time online inference for cold start ads relevance on a CPU platform. Our challenge is that previous methods typically prune all layers of the transformer to a high, uniform sparsity, thereby producing models which cannot achieve satisfactory inference speed with an acceptable accuracy. In this paper, we propose SwiftPruner - an efficient framework that leverages evolution-based search to automatically find the best-performing layer-wise sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Advanced Image and Video Retrieval Techniques · Digital Marketing and Social Media
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · WordPiece · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Adam
