Dependable Exploitation of High-Dimensional Unlabeled Data in an Assumption-Lean Framework

Chao Ying; Siyi Deng; Yang Ning; Jiwei Zhao; Heping Zhang

arXiv:2603.27869·stat.ME·March 31, 2026

Dependable Exploitation of High-Dimensional Unlabeled Data in an Assumption-Lean Framework

Chao Ying, Siyi Deng, Yang Ning, Jiwei Zhao, Heping Zhang

PDF

TL;DR

This paper develops a robust semi-supervised learning method that reliably improves high-dimensional regression inference by effectively utilizing unlabeled data, even under model misspecification.

Contribution

It introduces a novel estimator that guarantees at least as much efficiency as supervised methods, ensuring dependable use of unlabeled data in high-dimensional settings.

Findings

01

The debiased estimator's efficiency depends on the accurate estimation of the conditional mean.

02

The proposed estimator remains efficient even when the mean function is misspecified.

03

Simulation studies and real data applications demonstrate the method's effectiveness.

Abstract

Semi-supervised learning has attracted significant attention due to the proliferation of applications featuring limited labeled data but abundant unlabeled data. In this paper, we examine the statistical inference problem in an assumption-lean framework which involves a high-dimensional regression parameter, defined by minimizing the least squares, within the context of semi-supervised learning. We investigate when and how unlabeled data can enhance the estimation efficiency of a regression parameter functional. First, we demonstrate that a straightforward debiased estimator can only be more efficient than its supervised counterpart if the unknown conditional mean function can be consistently estimated at an appropriate rate. Otherwise, incorporating unlabeled data can actually be counterproductive. To address this vulnerability, we propose a novel estimator guaranteed to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.