NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj Gala

TL;DR
NC-Bench is a new benchmark for evaluating large language models' conversational skills based on natural conversation structure, covering basic, retrieval-augmented, and complex interaction patterns.
Contribution
It introduces a theory-grounded, extensible framework for assessing LLMs' conversational competence beyond traditional content-focused benchmarks.
Findings
Models excel at answering but struggle with repair tasks.
Performance varies across interaction types and complexity.
Complex multi-turn requests remain challenging for models.
Abstract
The Natural Conversation Benchmark (NC-Bench) introduces a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets: (1) the basic set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs; (2) the retrieval-augmented generation (RAG) set applies the same sequence management patterns as the first set but incorporates information-seeking via RAG; (3) the complex request set extends to requests involving more intricate sequence management patterns. Each set tests a model's ability to produce contextually appropriate conversational actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning
