Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

Sai Shridhar Balamurali; Lu Cheng

arXiv:2511.07659·cs.CL·November 12, 2025

Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

Sai Shridhar Balamurali, Lu Cheng

PDF

Open Access

TL;DR

This paper demonstrates that a simple, lightweight NLI-based metric can effectively evaluate large language models in question answering, matching GPT-4's accuracy while being computationally cheaper, and introduces a new human-annotated benchmark for evaluation.

Contribution

The study re-evaluates NLI-based metrics for LLM evaluation, showing they are competitive with expensive methods and introduces DIVER-QA, a new benchmark for human-aligned evaluation.

Findings

01

NLI-based scoring matches GPT-4 accuracy in long-form QA

02

Inexpensive metrics are competitive with costly evaluation methods

03

DIVER-QA benchmark enables rigorous human-aligned evaluation

Abstract

Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems