Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Sinan G. Aksoy, Alexandra A. Sabrio, Erik VonKaenel, Lee Burke

TL;DR
This paper introduces a scalable framework for systematically testing how large language models assess semantic similarity in document pairs, revealing biases and model-specific behaviors.
Contribution
It presents a multifactorial experimental setup to analyze LLM sensitivity to subtle semantic changes, uncovering positional biases and contextual effects across multiple models.
Findings
LLMs penalize earlier semantic differences more harshly.
Topically unrelated context lowers similarity scores and causes bipolarized scores.
Models exhibit a universal hierarchy in treating different perturbation types.
Abstract
We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
