Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls
Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky

TL;DR
This paper investigates the limitations of large language models as evaluators in code generation, introduces an analytic hint system to improve their accuracy, and demonstrates enhanced reliability through hybrid approaches in industrial settings.
Contribution
It presents a taxonomy of LaaJ blind spots, develops an analytic checker tool, and shows that combining hints with LaaJs improves error detection and explanation quality.
Findings
LaaJs detect 45-63% of errors independently.
The combined LaaJ+Hints approach achieves up to 74% error coverage.
Hybrid evaluation methods produce richer, more accurate explanations.
Abstract
Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
