BotEval: Facilitating Interactive Human Evaluation
Hyundong Cho, Thamme Gowda, Yuyang Huang, Zixun Lu, Tianli Tong,, Jonathan May

TL;DR
BotEval is an open-source toolkit designed to facilitate interactive human evaluation of NLP models, enabling direct human-bot interactions to better assess performance on complex tasks like conversation moderation.
Contribution
It introduces a customizable, user-friendly evaluation toolkit that supports human-bot interactions and integrates with crowdsourcing platforms, filling a gap in existing evaluation methods.
Findings
BotEval effectively evaluates chatbot performance in conversational moderation.
The toolkit offers flexible templates for various interactive evaluation scenarios.
BotEval enhances the realism and reliability of human evaluations in NLP research.
Abstract
Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly interact with these NLP models is essential for adequately evaluating the performance on such interactive tasks. We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process, as opposed to human evaluators making judgements for a static input. BotEval balances flexibility for customization and user-friendliness by providing templates for common use cases that span various degrees of complexity and built-in compatibility with popular crowdsourcing platforms. We showcase the numerous useful features of BotEval through a study that evaluates the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Evacuation and Crowd Dynamics
