PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models
Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish, Khetan, Sanjiv Das

TL;DR
PPLqa is an unsupervised, language-independent information-theoretic metric for evaluating the quality of responses from large language models, enabling model ranking without ground truth annotations.
Contribution
It introduces PPLqa, a novel unsupervised metric that assesses LLM response quality, correlates with human judgments, and works effectively for long-form Q&A tasks.
Findings
PPLqa performs comparably to existing metrics.
It works better with long-form responses.
It bypasses the need for ground truth annotations.
Abstract
We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\&A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
