SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick

TL;DR
This paper introduces SCORE, a multi-dimensional, reference-free evaluation framework for LLMs that assesses specificity, robustness, relevance, and context use, addressing gaps in current evaluation methods for high-stakes, domain-specific tasks.
Contribution
The paper proposes a novel multi-dimensional evaluation framework and a curated dataset for systematic, domain-sensitive assessment of LLM outputs in high-stakes settings.
Findings
No single metric fully captures answer quality.
Structured, multi-metric evaluation is essential for high-stakes deployment.
Human evaluation reveals subjectivity in domain-specific assessments.
Abstract
Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications
