Evaluating multiple models using labeled and unlabeled data

Divya Shanmugam; Shuvom Sadhuka; Manish Raghavan; John Guttag; Bonnie Berger; Emma Pierson

arXiv:2501.11866·cs.LG·October 15, 2025

Evaluating multiple models using labeled and unlabeled data

Divya Shanmugam, Shuvom Sadhuka, Manish Raghavan, John Guttag, Bonnie Berger, Emma Pierson

PDF

Open Access 1 Video

TL;DR

This paper introduces SSME, a semi-supervised evaluation method that leverages both labeled and unlabeled data to accurately assess classifiers, especially in domains with limited labeled datasets.

Contribution

The paper presents SSME, the first evaluation approach to utilize unlabeled data and multiple classifiers simultaneously for more accurate performance estimation.

Findings

01

SSME reduces evaluation error by 5.1x compared to using labeled data alone.

02

SSME outperforms existing methods in diverse domains like healthcare and image annotation.

03

It improves performance estimates across subgroups and for language models.

Abstract

It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating multiple models using labeled and unlabeled data· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)