NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application
Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, Qi Liu

TL;DR
NewsBERT introduces a novel knowledge distillation framework tailored for news applications, enabling efficient, smaller models that retain high performance for tasks like recommendation and retrieval.
Contribution
The paper presents a joint learning and momentum distillation approach specifically designed for news domain models, improving efficiency and effectiveness over general pre-trained models.
Findings
NewsBERT achieves comparable performance with significantly smaller models.
The momentum distillation method enhances knowledge transfer from teacher to student.
Experiments demonstrate improved accuracy in news recommendation and retrieval tasks.
Abstract
Pre-trained language models (PLMs) like BERT have made great progress in NLP. News articles usually contain rich textual information, and PLMs have the potentials to enhance news text modeling for various intelligent news applications like news recommendation and retrieval. However, most existing PLMs are in huge size with hundreds of millions of parameters. Many online news applications need to serve millions of users with low latency tolerance, which poses huge challenges to incorporating PLMs in these scenarios. Knowledge distillation techniques can compress a large PLM into a much smaller one and meanwhile keeps good performance. However, existing language models are pre-trained and distilled on general corpus like Wikipedia, which has some gaps with the news domain and may be suboptimal for news intelligence. In this paper, we propose NewsBERT, which can distill PLMs for efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Knowledge Distillation · Softmax · Dropout · Residual Connection · Layer Normalization · Attention Dropout · WordPiece · Multi-Head Attention · Adam
