Bias as a Virtue: Rethinking Generalization under Distribution Shifts

Ruixuan Chen; Wentao Li; Jiahui Xiao; Yuchen Li; Yimin Tang; Xiaonan Wang

arXiv:2506.00407·cs.LG·June 3, 2025

Bias as a Virtue: Rethinking Generalization under Distribution Shifts

Ruixuan Chen, Wentao Li, Jiahui Xiao, Yuchen Li, Yimin Tang, Xiaonan Wang

PDF

Open Access 3 Reviews

TL;DR

This paper proposes that intentionally increasing in-distribution bias during training can improve out-of-distribution generalization, challenging traditional validation methods and introducing a new framework called Adaptive Distribution Bridge.

Contribution

The paper introduces the Adaptive Distribution Bridge framework that leverages controlled statistical diversity to enhance OOD generalization, providing both practical methods and theoretical insights.

Findings

01

Higher in-distribution bias correlates with lower out-of-distribution error.

02

ADB achieves up to 26.8% reduction in mean error compared to traditional methods.

03

High percentile ranks (>74.4%) indicate effective identification of robust training strategies.

Abstract

Machine learning models often degrade when deployed on data distributions different from their training data. Challenging conventional validation paradigms, we demonstrate that higher in-distribution (ID) bias can lead to better out-of-distribution (OOD) generalization. Our Adaptive Distribution Bridge (ADB) framework implements this insight by introducing controlled statistical diversity during training, enabling models to develop bias profiles that effectively generalize across distributions. Empirically, we observe a robust negative correlation where higher ID bias corresponds to lower OOD error--a finding that contradicts standard practices focused on minimizing validation error. Evaluation on multiple datasets shows our approach significantly improves OOD generalization. ADB achieves robust mean error reductions of up to 26.8% compared to traditional cross-validation, and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

* The observation provided in this paper is interesting. It gives a novel and counterintuitive core Idea that ID error is negatively correlated with OOD error. If validated, it represents a significant shift in paradigm. * The ADB framework is described with a precise algorithm and two distinct computational approaches (Cumulative and Batchwise).

Weaknesses

* The theoretical analysis is questionable. The assumption that $b$ is non-negative is not reasonable. With an unknown distribution shift, the bias can not always reduce OOD error. Similarly, simply define $U(b) = (b-\Delta)^2$ is also questionable, $U(b)$ could also be $(b+\Delta)^2$. * Limited empirical evidence to validate the proposed method: With a questionable analysis, the intuition of the proposed framework is similar to previous works that train the model to learn stable features acro

Reviewer 02Rating 4Confidence 3

Strengths

1. The authors provide theoretical proof to support the claim that higher ID bias leads to reduced OOD error. 2. ADB framework is proposed to control the distribution shifts 3. Extensive experiments are conducted to support the findings

Weaknesses

1. If I understand correctly, the proof in 3.1 assumes a simplified model, does the conclusion generalize to more complicated settings? 2. What if $\Delta$ is not known? How can the author determine if $k < \alpha \Delta $? 3. Compuational cost might prohibit applications: "processing all 500 permutation paths required 266.5 total GPU hours with the batchwise approach versus 740 hours with the cumulative approach"

Reviewer 03Rating 4Confidence 4

Strengths

+ It is interesting to study the connection between ID bias and OOD generalization. The empirical observation that ID bias can lead to better OOD generalization is also interesting. The submission also provides a theoretical study (Section 3) to study this and discuss how this observation appears (Lines 151 - 156) + ADB introduces controlled statistical diversity during training by modifying data permutations (training order). It uses optimal transport distances (Sinkhorn distance with debiasin

Weaknesses

- [**Limited Domain Generalization**] The major concern is that ABD is only evaluated on regression-based tabular and molecular datasets. It is unknown how it performs on classification tasks and vision or language domains. As the claim is broad and general, it would be better to discuss whether the analysis holds for classification and other data types. - [**Disuccion on Latent Representations**] The ABD heavily relies on latent representations learned via VAEs. Then, how to make sure the late

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference