Robust Domain Generalization under Divergent Marginal and Conditional Distributions

Jewon Yeom; Kyubyung Chae; Hyunggyu Lim; Yoonna Oh; Dongyoon Yang; Taesup Kim

arXiv:2602.02015·cs.LG·February 3, 2026

Robust Domain Generalization under Divergent Marginal and Conditional Distributions

Jewon Yeom, Kyubyung Chae, Hyunggyu Lim, Yoonna Oh, Dongyoon Yang, Taesup Kim

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a unified framework for robust domain generalization that handles simultaneous divergence in both marginal and conditional distributions, improving generalization to unseen domains.

Contribution

It proposes a novel risk bound considering joint distribution shifts and a meta-learning approach to optimize for strong generalization across diverse domains.

Findings

01

Achieves state-of-the-art results on standard DG benchmarks.

02

Performs well in multi-domain long-tailed recognition scenarios.

03

Effectively handles complex distribution shifts in real-world data.

Abstract

Domain generalization (DG) aims to learn predictive models that can generalize to unseen domains. Most existing DG approaches focus on learning domain-invariant representations under the assumption of conditional distribution shift (i.e., primarily addressing changes in $P (X ∣ Y)$ while assuming $P (Y)$ remains stable). However, real-world scenarios with multiple domains often involve compound distribution shifts where both the marginal label distribution $P (Y)$ and the conditional distribution $P (X ∣ Y)$ vary simultaneously. To address this, we propose a unified framework for robust domain generalization under divergent marginal and conditional distributions. We derive a novel risk bound for unseen domains by explicitly decomposing the joint distribution into marginal and conditional components and characterizing risk gaps arising from both sources of divergence. To operationalize…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper provides a clean theoretical decomposition of domain generalization risk into interpretable components (prior shift and feature shift). 2. Good performance on standard DG and MDLT benchmarks.

Weaknesses

1. While the decomposition is useful, the individual components (domain alignment, meta-learning for DG) are well-established. The main contribution is combining them with theoretical justification, but the theoretical tools (Wasserstein distance bounds, InfoNCE decomposition) are standard. 2. The definition of $\pi$ in Theorem 1 is missing. 3. Although the theory motivates minimizing Wasserstein feature distance, the implemented DA loss is a heuristic contrastive loss that aligns features wit

Reviewer 02Rating 6Confidence 4

Strengths

The paper introduces a principled and interpretable risk decomposition that explicitly separates the effects of prior and feature distribution shifts. Empirically verify the correlation between DA loss and generalisation gap, and conduct ablations to show the complementary effects of the DA loss, meta-learning, and Manifold Mixup.

Weaknesses

1. The generalization under concurrent marginal and conditional distribution shifts has been extensively studied in prior works (e.g., Hu et al., 2020; Tan et al., 2024), suggesting that this might not be a critical gap. However, this does not detract from the systemic and insightful theoretical framework presented by the authors. 2. A primary theoretical concern is that the PL condition represents a strong assumption regarding the non-convex loss landscapes inherent to deep neural networks (as

Reviewer 03Rating 4Confidence 4

Strengths

* This paper conducts a solid theoretical analysis of the domain generalization error. The decomposition into prior and feature shift terms is intuitive and insightful. The algorithm design is also closely connected with these theoretical impressions. * Experimental results show decent performance improvement on both standard DG and MDLT benchmarks.

Weaknesses

* From my current understanding, the current theoretical framework cannot guarantee generalization on an arbitrary target domain. In Theorem 3, the performance depends on the Wasserstein distance between the target data distribution and its best approximation via interpolation between source domains. Hence, it only guarantees generalization under the condition that the target domain is an interpolation of source domains, which is known to be well-resolved by ERM. The cases where the target domai

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Face recognition and analysis · Topic Modeling