Token Masking Improves Transformer-Based Text Classification
Xianglong Xu, John Bowen, Rojin Taheri

TL;DR
This paper introduces token masking regularization for transformer models, which randomly masks input tokens during training to improve text classification performance by reducing overfitting and smoothing gradients.
Contribution
It proposes a simple, theoretically motivated token masking method that enhances transformer-based text classifiers across multiple models and tasks.
Findings
Consistent performance improvements across models and tasks.
Optimal masking rate identified at p=0.1.
Gains attributed to reduced overfitting and implicit ensembling.
Abstract
While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated method that randomly replaces input tokens with a special [MASK] token at probability p. This introduces stochastic perturbations during training, leading to implicit gradient averaging that encourages the model to capture deeper inter-token dependencies. Experiments on language identification and sentiment analysis -- across diverse models (mBERT, Qwen2.5-0.5B, TinyLlama-1.1B) -- show consistent improvements over standard regularization techniques. We identify task-specific optimal masking rates, with p = 0.1 as a strong general default. We attribute the gains to two key effects: (1) input perturbation reduces overfitting, and (2) gradient-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Authorship Attribution and Profiling · Text and Document Classification Technologies
