Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Mohamed Hesham Elganayni, Runsheng Chen, Sebastian Nagl, Matthias Grabmair

TL;DR
This paper investigates how prompt optimization and judge selection affect LLM-based legal QA evaluations, showing that automated prompt tuning with lenient judge feedback enhances transferability and performance over human-designed prompts.
Contribution
It introduces a systematic approach to optimize prompts using judge feedback, revealing how judge disposition influences prompt effectiveness and transferability across different judges.
Findings
Automated prompt optimization outperforms human-centered design.
Lenient judge feedback yields higher, more consistent gains.
Optimized prompts transfer better from lenient to strict judges.
Abstract
This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
