Grounding and Evaluation for Large Language Models: Practical Challenges   and Lessons Learned (Survey)

Krishnaram Kenthapadi; Mehrnoosh Sameki; Ankur Taly

arXiv:2407.12858·cs.CL·July 19, 2024

Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

Krishnaram Kenthapadi, Mehrnoosh Sameki, Ankur Taly

PDF

TL;DR

This survey reviews the challenges and methods for evaluating and ensuring the trustworthiness of large language models, emphasizing issues like hallucinations, bias, and safety in high-stakes AI applications.

Contribution

It provides a comprehensive overview of current evaluation techniques, identifies open challenges, and offers lessons learned for improving the reliability of generative AI systems.

Findings

01

Survey of state-of-the-art evaluation approaches

02

Identification of key challenges in LLM safety and trustworthiness

03

Discussion of open problems and future directions

Abstract

With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes domains, ensuring the trustworthiness, safety, and observability of these systems has become crucial. It is essential to evaluate and monitor AI systems not only for accuracy and quality-related metrics but also for robustness, bias, security, interpretability, and other responsible AI dimensions. We focus on large language models (LLMs) and other generative AI models, which present additional challenges such as hallucinations, harmful and manipulative content, and copyright infringement. In this survey article accompanying our KDD 2024 tutorial, we highlight a wide range of harms associated with generative AI systems, and survey state of the art approaches (along with open challenges) to address these harms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus