How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows
Songhee Han, Jueun Shin, Jiyoon Han, Bung-Woo Jun, and Hilal Ayan Karabatman

TL;DR
This study evaluates the effectiveness of LLM-as-judge evaluations in assessing interpretive responses from various models, finding they align broadly with human judgments but have limitations in score accuracy and nuance.
Contribution
It provides empirical evidence on the reliability of LLM-based evaluations for interpretive quality, guiding model selection in qualitative research workflows.
Findings
LLM-as-judge scores reflect broad trends but differ in magnitude from human ratings.
Coherence metric aligns best with human evaluations among automated metrics.
Automated metrics often misjudge nuanced or non-literal interpretive responses.
Abstract
As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
