Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering
Sai Shridhar Balamurali, Lu Cheng

TL;DR
This paper demonstrates that a simple, lightweight NLI-based metric can effectively evaluate large language models in question answering, matching GPT-4's accuracy while being computationally cheaper, and introduces a new human-annotated benchmark for evaluation.
Contribution
The study re-evaluates NLI-based metrics for LLM evaluation, showing they are competitive with expensive methods and introduces DIVER-QA, a new benchmark for human-aligned evaluation.
Findings
NLI-based scoring matches GPT-4 accuracy in long-form QA
Inexpensive metrics are competitive with costly evaluation methods
DIVER-QA benchmark enables rigorous human-aligned evaluation
Abstract
Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems
