End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang; Tung Le; Huy Tien Nguyen

arXiv:2603.10570·cs.CL·March 12, 2026

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang, Tung Le, Huy Tien Nguyen

PDF

Open Access

TL;DR

This paper presents an automatic, scalable evaluation framework for chatbots that leverages LLMs for response judgment and confidence filtering, reducing human effort and applicable across languages and domains.

Contribution

The authors introduce a modular, language-agnostic evaluation system that automates chatbot assessment using LLMs and confidence filtering, improving scalability and reducing manual review.

Findings

01

High agreement with human judgments on Vietnamese news dataset

02

Significantly reduces review overhead

03

Modular and adaptable to various domains

Abstract

Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Topic Modeling · Artificial Intelligence in Healthcare and Education