Toward Human-AI Complementarity Across Diverse Tasks

Yuzheng Xu; Annya Dahmani; Matthew D. Blanchard; Niclas Dern; Edy Nastase; Francesca Bianco; Maja Pavlovic; Sukanya Krishna; Eric Modesitt; Miranda Anna Christ; Arth Singh; Gaia Molinaro; Sikata Bela Sengupta; Jaji Pamarthi; Arjun Menon; Rishub Jain

arXiv:2605.04070·cs.HC·May 7, 2026

Toward Human-AI Complementarity Across Diverse Tasks

Yuzheng Xu, Annya Dahmani, Matthew D. Blanchard, Niclas Dern, Edy Nastase, Francesca Bianco, Maja Pavlovic, Sukanya Krishna, Eric Modesitt, Miranda Anna Christ, Arth Singh, Gaia Molinaro, Sikata Bela Sengupta, Jaji Pamarthi, Arjun Menon, Rishub Jain

PDF

TL;DR

This paper evaluates human-AI collaboration methods across diverse tasks, revealing modest gains and highlighting the importance of decision routing and assistance design for effective oversight.

Contribution

It provides a comprehensive analysis of human-AI complementarity on realistic tasks, identifying key bottlenecks and offering concrete directions for improving collaboration methods.

Findings

01

Hybridization yields only +0.4 pp improvement over AI alone.

02

Top-2 assistance increases human accuracy from 28.4% to 38.3%.

03

Confidence-based routing struggles to identify complementarity regions.

Abstract

Human-AI complementarity, the idea that combining human and AI judgments can outperform either alone, offers a promising pathway toward robust oversight of advanced AI systems. However, whether human-AI complementarity can be achieved on realistic tasks remains an open question. We investigate this through two approaches: hybridization and two AI assistance methods (top-2 assistance and subtask delegation), evaluated on a multi-domain dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. We find only modest complementarity gains. Baseline hybridization yields just +0.4 percentage points (pp) over AI alone (69.3\% vs 68.9\%), limited both by a small complementarity region (only 8.9\% of items where AI errs but humans do not) and the inability of confidence-based routing to identify it, since the model's confidence is similarly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.