TL;DR
This paper identifies inconsistency issues in CATE estimation algorithms across group assignments, introduces a metric to measure it, and proposes CLAGA to improve estimation accuracy, validated by experiments.
Contribution
It introduces CLAGA, a novel method to ensure consistent labeling across group assignments in CATE estimation, reducing variance and improving performance.
Findings
Inconsistency in CATE algorithms increases test error.
CLAGA effectively eliminates labeling inconsistency.
Experiments show improved accuracy with CLAGA on various datasets.
Abstract
Numerous algorithms have been developed for Conditional Average Treatment Effect (CATE) estimation. In this paper, we first highlight a common issue where many algorithms exhibit inconsistent learning behavior for the same instance across different group assignments. We introduce a metric to quantify and visualize this inconsistency. Next, we present a theoretical analysis showing that this inconsistency indeed contributes to higher test errors and cannot be resolved through conventional machine learning techniques. To address this problem, we propose a general method called \textbf{Consistent Labeling Across Group Assignments} (CLAGA), which eliminates the inconsistency and is applicable to any existing CATE estimation algorithm. Experiments on both synthetic and real-world datasets demonstrate significant performance improvements with CLAGA.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Originality: I like the idea and I think the decomposition and the extra layer of thinking on estimating CATE with PEHE as loss is quite original. Quality: The paper has good theoretical justification, comprehensive experiments and clear algorithm descriptions. Significance: I believe this consistency among group property is quite important in some applications.
I like the paper but I think the main weakness is the clarity of the presentation and the experiments. Details below: Clarity: 1. line 191: I found the notation quite confusing. $\hat{\tau}(x)$ is usually a deterministic quantity with lower case $x$. 2. Similar for line 237, the notation is quite confusing to me. 3. Line 323: this is quite confusing, what is well-trained and what is well-designed. 4. line 372 to 3745: still quite confusing. In particular I don't see how these different $\ta
The paper provides simulation and real-world studies.
1. The paper's theoretical foundation is excessively simple, with key arguments' validity contingent upon vague definitions and assumptions. If the authors claim "theoretical analysis", they must rigorously align with established literature in the field. For instance, the paper's use of "consistency" appears misapplied. 2. The manuscript seems preoccupied with the issue of unbiased estimators. It is overlooked that even biased estimators may converge to the true parameter as sample size increas
Its merits aside, the proposed metric seems to be novel.
I don't understand what line 191 means. If you condition on $X=x$, then the value of $W$ doesn't matter. So I understand line 191 to mean that you're taking an expectation with respect to the training data, and only the training data. The proposed discrepancy measure seems something about the overfitting of the estimators under a strong null hypothesis of both (i) treatment randomized independently of covariates and (ii) no distributional treatment effect. I don't understand why this setting wo
- The authors clearly demonstrate a problem of cate estimation - The authors provide theoretical insight into the estimation error via decomposition, which may be a useful tool for discussion and analysis - The authors present a new framework intended to address the identified inconsistency in CATE estimation
- By developing their own metric on which they showcase a problem and own solution, the authors develop their own niche with no immediate comparison work available. Nevertheless other works concerned with overall CATE estimators performance exist and should be included. For example, algorithmic fairness research may be directly relevant here. - The presented issue and consequent solution are motivated by Figure 1, which is correctly caption as showcasing overfitting. Method for tackling overfitt
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
