Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
Krishnaram Kenthapadi, Mehrnoosh Sameki, Ankur Taly

TL;DR
This survey reviews the challenges and methods for evaluating and ensuring the trustworthiness of large language models, emphasizing issues like hallucinations, bias, and safety in high-stakes AI applications.
Contribution
It provides a comprehensive overview of current evaluation techniques, identifies open challenges, and offers lessons learned for improving the reliability of generative AI systems.
Findings
Survey of state-of-the-art evaluation approaches
Identification of key challenges in LLM safety and trustworthiness
Discussion of open problems and future directions
Abstract
With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes domains, ensuring the trustworthiness, safety, and observability of these systems has become crucial. It is essential to evaluate and monitor AI systems not only for accuracy and quality-related metrics but also for robustness, bias, security, interpretability, and other responsible AI dimensions. We focus on large language models (LLMs) and other generative AI models, which present additional challenges such as hallucinations, harmful and manipulative content, and copyright infringement. In this survey article accompanying our KDD 2024 tutorial, we highlight a wide range of harms associated with generative AI systems, and survey state of the art approaches (along with open challenges) to address these harms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
