Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation
Jiayi Li, Shijie Tang, G\"un Kaynar, Shiyi Du, Carl Kingsford

TL;DR
This paper introduces Shortcut Guardrail, a deployment-time method that mitigates token-level shortcuts in pretrained language models without needing original training data or shortcut annotations.
Contribution
It presents a novel, lightweight, gradient-based attribution approach combined with a Masked Contrastive Learning objective for effective shortcut mitigation at deployment.
Findings
Improves overall accuracy on tasks with shortcuts.
Enhances worst-group accuracy under distribution shifts.
Preserves in-distribution performance.
Abstract
Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
