Domain Generalization via Pareto Optimal Gradient Matching

Khoi Do; Duong Nguyen; Nam-Khanh Le; Quoc-Viet Pham; Binh-Son Hua; Won-Joo Hwang

arXiv:2507.14227·cs.LG·July 22, 2025

Domain Generalization via Pareto Optimal Gradient Matching

Khoi Do, Duong Nguyen, Nam-Khanh Le, Quoc-Viet Pham, Binh-Son Hua, Won-Joo Hwang

PDF

3 Reviews

TL;DR

This paper introduces POGM, a novel gradient matching method for domain generalization that balances domain knowledge integration and computational efficiency, outperforming existing approaches.

Contribution

We propose POGM, a Pareto optimal gradient matching technique that uses gradient trajectories and meta-learning to improve domain generalization.

Findings

01

Competitive performance on DomainBed datasets

02

Enhanced computational efficiency

03

Effective gradient trajectory utilization

Abstract

In this study, we address the gradient-based domain generalization problem, where predictors aim for consistent gradient directions across different domains. Existing methods have two main challenges. First, minimization of gradient empirical distance or gradient inner products (GIP) leads to gradient fluctuations among domains, thereby hindering straightforward learning. Second, the direct application of gradient learning to the joint loss function can incur high computation overheads due to second-order derivative approximation. To tackle these challenges, we propose a new Pareto Optimality Gradient Matching (POGM) method. In contrast to existing methods that add gradient matching as regularization, we leverage gradient trajectories as collected data and apply independent training at the meta-learner. In the meta-update, we maximize GIP while limiting the learned gradient from…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

The idea of effectively constraining or regularising the DG-targeted GIP by ERM seems novel and interesting, and the results appear to show that things improve (modulo weaknesses discussed below). Moreover, the meta-learning setup seems to additionally allow for efficiencies in terms of choosing when to optimise the meta learner itself, rather than all the time. The experiments seem quite thorough also with well-standardised settings and hyperparameter search.

Weaknesses

The main weaknesses with this work are the following: a. The discussion of the shotcomings of Fish and Fishr are quite difficult to make sense of. For a piece of work that relies on these for motivating a fix, this should be done much better than it currently is, where it seems quite rushed and disjointed. 1. Figure 1 has a lot of stuff in it that isn't explained in the captions. What is the reader to make of the arrows and trajectory of the optimisation in (1a) with regard to gradi

Reviewer 02Rating 5Confidence 3

Strengths

1. The proposed POGM method is well-motivated, with sufficient theoretical and empirical analysis on existing gradient-based DG methods. 2. The paper is well-written and the methodology is clearly explained . 3. The proposed POGM method is evaluated on several benchmark DG datasets.

Weaknesses

1. Missing comparison with other baselines based on training-domain selection method. As mentioned in DomainBed (Gulrajani & Lopez-Paz, 2021), training-domain selection or leave-one-out method is better suited for the DG setting. 2. As shown in Table 1, POGM achieves only marginal improvements compared to other baseline methods. Specifically, it achieves the best performance on only 3 out of the 7 datasets. 3. The base feature extractor used in POGM differs for the VLCS, PACS, and OfficeHome da

Reviewer 03Rating 5Confidence 4

Strengths

1. The experiments of the proposed method achieve surprisingly good performance than previous methods.

Weaknesses

1. The paper is poorly written and hard to follow, especially in Sec. 4 where it is hard to understand the logic of method design. 2. The Pareto optimality is employed only for the worst-case matching gradient, which is actually unnecessary. Maybe the combination of $\pi$ is suitable to apply Pareto optimality. 3. Previous related methods that adopt Pareto Optimality [1,2], and training trajectory [3] should be discussed and compared. 4. Why employ the trajectory matching and how it works is no

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.