Towards Understanding Self-Pretraining for Sequence Classification
Omar Coser, Loredana Zollo, Paolo Soda, Antonio Orvieto

TL;DR
This paper investigates how self-pretraining improves Transformer sequence classification, revealing that it helps learn useful attention patterns and proximity biases, which standard supervised training often misses.
Contribution
It systematically replicates and ablates Amos et al. (2024), identifying key mechanisms like proximity interactions and attention pattern learning in self-pretraining.
Findings
Self-pretraining enhances attention pattern learning in Transformers.
Proximity-biased attention scores are a key source of improvement.
Supervised training can be blind to certain attention directions.
Abstract
Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
