Signal or Noise? Evaluating Large Language Models in Resume Screening Across Contextual Variations and Human Expert Benchmarks
Aryan Varshney, Venkat Ram Reddy Ganuthula

TL;DR
This study evaluates large language models' consistency and reliability in resume screening across different contexts and compares their performance to human experts, revealing significant differences and adaptive behaviors.
Contribution
It provides a comprehensive analysis of LLMs' performance variability and their divergence from human judgment in automated resume screening tasks.
Findings
LLMs show significant performance differences across contexts.
GPT adapts strongly to company context, more than other LLMs.
All LLMs differ significantly from human experts in evaluations.
Abstract
This study investigates whether large language models (LLMs) exhibit consistent behavior (signal) or random variation (noise) when screening resumes against job descriptions, and how their performance compares to human experts. Using controlled datasets, we tested three LLMs (Claude, GPT, and Gemini) across contexts (No Company, Firm1 [MNC], Firm2 [Startup], Reduced Context) with identical and randomized resumes, benchmarked against three human recruitment experts. Analysis of variance revealed significant mean differences in four of eight LLM-only conditions and consistently significant differences between LLM and human evaluations (p < 0.01). Paired t-tests showed GPT adapts strongly to company context (p < 0.001), Gemini partially (p = 0.038 for Firm1), and Claude minimally (p > 0.1), while all LLMs differed significantly from human experts across contexts. Meta-cognition analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
