Mitigating Transformer Overconfidence via Lipschitz Regularization
Wenqian Ye, Yunsheng Ma, Xu Cao, Kun Tang

TL;DR
This paper introduces LRFormer, a Lipschitz regularized Transformer that reduces overconfidence in predictions by ensuring Lipschitz continuity, leading to improved calibration and uncertainty estimation in vision tasks.
Contribution
The paper proposes a novel Lipschitz regularization method for Transformers using a new similarity function within Banach Space, with theoretical guarantees and superior empirical performance.
Findings
Outperforms state-of-the-art methods in prediction accuracy
Improves calibration and uncertainty estimation
Provides theoretical guarantees for Lipschitz regularization
Abstract
Though Transformers have achieved promising results in many computer vision tasks, they tend to be over-confident in predictions, as the standard Dot Product Self-Attention (DPSA) can barely preserve distance for the unbounded input domain. In this work, we fill this gap by proposing a novel Lipschitz Regularized Transformer (LRFormer). Specifically, we present a new similarity function with the distance within Banach Space to ensure the Lipschitzness and also regularize the term by a contractive Lipschitz Bound. The proposed method is analyzed with a theoretical guarantee, providing a rigorous basis for its effectiveness and reliability. Extensive experiments conducted on standard vision benchmarks demonstrate that our method outperforms the state-of-the-art single forward pass approaches in prediction, calibration, and uncertainty estimation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Label Smoothing · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Residual Connection
