Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Ora Nova Fandina; Eitan Farchi; Shmulik Froimovich; Raviv Gal; Wesam Ibraheem; Rami Katan; Alice Podolsky

arXiv:2512.16272·cs.SE·January 21, 2026

Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky

PDF

Open Access 1 Video

TL;DR

This paper investigates the limitations of large language models as evaluators in code generation, introduces an analytic hint system to improve their accuracy, and demonstrates enhanced reliability through hybrid approaches in industrial settings.

Contribution

It presents a taxonomy of LaaJ blind spots, develops an analytic checker tool, and shows that combining hints with LaaJs improves error detection and explanation quality.

Findings

01

LaaJs detect 45-63% of errors independently.

02

The combined LaaJ+Hints approach achieves up to 74% error coverage.

03

Hybrid evaluation methods produce richer, more accurate explanations.

Abstract

Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls· underline

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques