Optimizing importance weighting in the presence of sub-population shifts
Floris Holstege, Bram Wouters, Noud van Giersbergen, Cees Diks

TL;DR
This paper introduces a bi-level optimization approach for importance weighting to better handle sub-population shifts, improving model generalization by balancing bias and variance during training.
Contribution
It proposes a novel bi-level optimization method for importance weights that accounts for variance, enhancing robustness to distribution shifts in deep learning models.
Findings
Optimizing importance weights improves test performance.
The method reduces variance in importance weighting estimates.
Empirical results show significant generalization gains.
Abstract
A distribution shift between the training and test data can severely harm performance of machine learning models. Importance weighting addresses this issue by assigning different weights to data points during training. We argue that existing heuristics for determining the weights are suboptimal, as they neglect the increase of the variance of the estimated model due to the finite sample size of the training data. We interpret the optimal weights in terms of a bias-variance trade-off, and propose a bi-level optimization procedure in which the weights and model parameters are optimized simultaneously. We apply this optimization to existing importance weighting techniques for last-layer retraining of deep neural networks in the presence of sub-population shifts and show empirically that optimizing weights significantly improves generalization performance.
Peer Reviews
Decision·ICLR 2025 Poster
1. The bi-level optimization of importance weights is novel, addressing the bias-variance trade-off more effectively than heuristic-based weighting. 2. The analysis using a linear regression model provides clear insights into the bias-variance tradeoff, making the proposed method more convincing. 3. The experiment section is comprehensive, covering multiple datasets. 4. The proposed method is useful in practice.
1. The computational overhead of the proposed methods should be discussed. Since we only need to train on the last layer, I think the computational overhead is not large? 2. The proposed method need to separate the training dataset into val dataset and remaining train dataset. What's the tradeoff here? What's the best way to split? If the training dataset is limited, will the proposed method work?
The paper is well-written and systematically defines all terms and equations with attention to detail. It is mathematically well-supported and contains relevant experiments on standard benchmarks.
Refer to Questions
- The arguement that "the conventional importance weighting $p_{te}(x,y)/ p_{tr}(x,y)$ could be sub-optimal" is interesting. - This paper estimates model parameters and importance weights iteratively on train and iid-validation datasets. That is interesting, because it reduces the overfitting of training dataset. Different from conventional cross-validation and hyper-parameter search, this approach find the importance weights by optimization.
- Experiments shows the good performance of the proposed method. However, Algorithm 1 (line 278 $p_0$ and line 282 $r$) requires the access of test dataset as parameter initialization. This could be problematic. I understand that some explicit reweighting methods, such as JTT, also contains some heuristic & data-dependent importance weights searching space. These heuristic hyperparameter searching space could also depends on test dataset. However, group distributional robust optimization (GDRO)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInsurance, Mortality, Demography, Risk Management
