Exploring Precision and Recall to assess the quality and diversity of LLMs
Florian Le Bronnec, Alexandre Verine, Benjamin Negrevergne, Yann, Chevaleyre, Alexandre Allauzen

TL;DR
This paper proposes a new evaluation framework for Large Language Models that applies Precision and Recall metrics from image generation to text, providing a nuanced assessment of quality and diversity without requiring aligned datasets.
Contribution
It introduces a novel evaluation approach for LLMs using Precision and Recall, revealing insights into their performance on open-ended tasks and highlighting trade-offs between quality and diversity.
Findings
Trade-off observed between quality and diversity in generated texts
Fine-tuning on instruction datasets affects model performance
New evaluation toolkit extends NLP assessment methods
Abstract
We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsWikis in Education and Collaboration
