SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Homaira Huda Shomee; Rochana Chaturvedi; Yangxinyu Xie; Tanwi Mallick

arXiv:2602.10017·cs.CL·February 11, 2026

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick

PDF

Open Access

TL;DR

This paper introduces SCORE, a multi-dimensional, reference-free evaluation framework for LLMs that assesses specificity, robustness, relevance, and context use, addressing gaps in current evaluation methods for high-stakes, domain-specific tasks.

Contribution

The paper proposes a novel multi-dimensional evaluation framework and a curated dataset for systematic, domain-sensitive assessment of LLM outputs in high-stakes settings.

Findings

01

No single metric fully captures answer quality.

02

Structured, multi-metric evaluation is essential for high-stakes deployment.

03

Human evaluation reveals subjectivity in domain-specific assessments.

Abstract

Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications