A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis
Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami

TL;DR
This paper introduces a novel, scalable benchmark for evaluating large language models' open-ended text generation using n-gram statistics and rules, avoiding human or LLM-based judgments.
Contribution
It presents a new benchmark with three metrics—Fluency, Truthfulness, Helpfulness—that correlates well with GPT-4 evaluations but requires fewer resources.
Findings
Strong correlation with GPT-4 evaluations
Requires significantly less computational resources
Effective for scalable assessment of LLMs' open-ended generation
Abstract
Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
