NewsBERT: Distilling Pre-trained Language Model for Intelligent News   Application

Chuhan Wu; Fangzhao Wu; Yang Yu; Tao Qi; Yongfeng Huang; Qi Liu

arXiv:2102.04887·cs.CL·September 3, 2021

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, Qi Liu

PDF

TL;DR

NewsBERT introduces a novel knowledge distillation framework tailored for news applications, enabling efficient, smaller models that retain high performance for tasks like recommendation and retrieval.

Contribution

The paper presents a joint learning and momentum distillation approach specifically designed for news domain models, improving efficiency and effectiveness over general pre-trained models.

Findings

01

NewsBERT achieves comparable performance with significantly smaller models.

02

The momentum distillation method enhances knowledge transfer from teacher to student.

03

Experiments demonstrate improved accuracy in news recommendation and retrieval tasks.

Abstract

Pre-trained language models (PLMs) like BERT have made great progress in NLP. News articles usually contain rich textual information, and PLMs have the potentials to enhance news text modeling for various intelligent news applications like news recommendation and retrieval. However, most existing PLMs are in huge size with hundreds of millions of parameters. Many online news applications need to serve millions of users with low latency tolerance, which poses huge challenges to incorporating PLMs in these scenarios. Knowledge distillation techniques can compress a large PLM into a much smaller one and meanwhile keeps good performance. However, existing language models are pre-trained and distilled on general corpus like Wikipedia, which has some gaps with the news domain and may be suboptimal for news intelligence. In this paper, we propose NewsBERT, which can distill PLMs for efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Knowledge Distillation · Softmax · Dropout · Residual Connection · Layer Normalization · Attention Dropout · WordPiece · Multi-Head Attention · Adam