TL;DR
This paper introduces a scalable oversight framework that leverages weak complementary labels from humans with narrow expertise to evaluate and improve advanced AI systems without ground truth, enabling efficient supervision.
Contribution
It proposes unbiased estimators for AI evaluation using complementary labels, combining scarce ordinary labels with abundant weak signals, and demonstrates self-improving AI with partitioned human supervision.
Findings
Unbiased estimator of top-1 accuracy from complementary labels
Quantification of complementary labels needed to match variance of ordinary labels
Empirical validation of evaluation and training of large language models without ground truth
Abstract
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains, where this bottleneck is severe. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that ''this is not related to any cardiovascular disease,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us…
Peer Reviews
Decision·ICLR 2026 Poster
1. I think the theoretical framing of complementary labels as sufficient to estimate model performance is clean and clearly motivated. I like that the paper focuses on the oversight bottleneck, which is timely. 2. The derivation of the estimator and the variance analysis are correct to me. The decomposition and reasoning about how the uniform query mechanism leads to unbiased estimation is neatly done. I also found the combination with ordinary labels to reduce variance to be intuitive. 3. The e
1. While I follow the theoretical contribution, I am less convinced about the real-world applicability for the kind of motivation the paper has. The experiments are exclusively on closed-form multiple-choice tasks. However, the motivating context in the paper is open-ended and superhuman domains. In that sense, there is a bit of a mismatch: the paper does not really demonstrate that the method scales to the intended setting. 2. I also think the paper assumes too much familiarity with the labelin
- This work studies a very relevant topic that is fundamental for the enhancement of current AI systems, especially for their training and evaluation w.r.t. humans. - The paper provides mathematical details, motivation and theoretical analysis of its proposed framework. In this way, the contributions are concretely based on mathematical principles and characterization. - The result demonstrating that weak human feedback can provide useful learning signal is very promising.
- Top-1 accuracy can be scarce or too limited to evaluate the performance of a foundation model. - The presentation is a bit difficult to follows as it is very technical. However, I don't consider this a proper weakness. - If I understand correctly, the number of complementary labels needed for a satisfactory training can be very large. - I didn't understand if it is a typo, but several paragraph titles are coloured, which is quite uncommon.
This work regarding how we can leverage human data in nontraditional ways via complementary labels is original and timely to the current dearth of adequate evaluation paradigms for LLMs. The paper leverages high-quality evaluation paradigms to support their claims, approaching the problem both empirically (across multiple benchmarks, tasks, and domains) and theoretically (unbiased estimator derivation, bounding the variance of the estimators, quantifying how many complementary labels equate to a
One of my biggest qualms with the work is the general motivation framework. “...future models will tackle problems whose solutions are too technical or too crossdisciplinary for any single human to verify comprehensively. When we cannot produce ground truth or prepare automated verifiers, how should we evaluate and train such systems?” In theory there is value to this question and motivation, but how do we ensure that “not cardiology” is an informative label? Specifically, in this toy example, w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
