Assessing Robustness to Spurious Correlations in Post-Training Language Models
Julia Shuieh, Prasann Singhal, Apaar Shanker, John Heyer, George Pu, Samuel Denton

TL;DR
This paper evaluates how different post-training methods for language models handle spurious correlations, revealing that robustness varies by task type and correlation nature, with no single method being universally best.
Contribution
It systematically compares three post-training algorithms across diverse tasks and spuriousness conditions, providing insights into their robustness and limitations.
Findings
Preference-based methods show robustness in mathematical reasoning.
Supervised Fine-Tuning performs better on complex, context-rich tasks.
Model performance degrades with increased spurious correlations, but varies by method.
Abstract
Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other "shortcut" features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts: "Feature Ambiguity" and "Distributional Narrowness." Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsShrink and Fine-Tune · Sparse Evolutionary Training
