Exploring Precision and Recall to assess the quality and diversity of   LLMs

Florian Le Bronnec; Alexandre Verine; Benjamin Negrevergne; Yann; Chevaleyre; Alexandre Allauzen

arXiv:2402.10693·cs.CL·June 5, 2024·3 cites

Exploring Precision and Recall to assess the quality and diversity of LLMs

Florian Le Bronnec, Alexandre Verine, Benjamin Negrevergne, Yann, Chevaleyre, Alexandre Allauzen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a new evaluation framework for Large Language Models that applies Precision and Recall metrics from image generation to text, providing a nuanced assessment of quality and diversity without requiring aligned datasets.

Contribution

It introduces a novel evaluation approach for LLMs using Precision and Recall, revealing insights into their performance on open-ended tasks and highlighting trade-offs between quality and diversity.

Findings

01

Trade-off observed between quality and diversity in generated texts

02

Fine-tuning on instruction datasets affects model performance

03

New evaluation toolkit extends NLP assessment methods

Abstract

We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexverine/pr-4-llm
noneOfficial

Videos

Exploring Precision and Recall to assess the quality and diversity of LLMs· underline

Taxonomy

TopicsWikis in Education and Collaboration