Does RoBERTa Perform Better than BERT in Continual Learning: An   Attention Sink Perspective

Xueying Bai; Yifan Sun; Niranjan Balasubramanian

arXiv:2410.05648·cs.LG·October 10, 2024

Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

Xueying Bai, Yifan Sun, Niranjan Balasubramanian

PDF

Open Access 1 Repo

TL;DR

This paper investigates how attention sink tokens in pre-trained models like RoBERTa and BERT affect continual learning performance, proposing a pre-scaling method to improve attention diversity and CL outcomes.

Contribution

It introduces a novel pre-scaling mechanism that reduces attention sink effects, enhancing continual learning performance of pre-trained models.

Findings

01

Pre-scaling improves CL performance without experience replay.

02

Attention sinks like [SEP] tokens impact model interference in sequential tasks.

03

Pre-trained models with attention sinks may underperform despite high capacity.

Abstract

Continual learning (CL) aims to train models that can sequentially learn new tasks without forgetting previous tasks' knowledge. Although previous works observed that pre-training can benefit CL, it remains unclear whether a pre-trained model with higher downstream capacity also performs better in CL. In this paper, we observe that pre-trained models may allocate high attention scores to some 'sink' tokens, such as [SEP] tokens, which are ubiquitous across various tasks. Such attention sinks may lead to models' over-smoothing in single-task learning and interference in sequential tasks' learning, which may compromise the models' CL performance despite their high pre-trained capabilities. To reduce these effects, we propose a pre-scaling mechanism that encourages attention diversity across all tokens. Specifically, it first scales the task's attention to the non-sink tokens in a probing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

StonyBrookNLP/attention-sink-cl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsSoftmax · Attention Is All You Need · Attention Sinks