Assessing Task-based Chatbots: Snapshot and Curated Datasets for Dialogflow
Elena Masserini, Diego Clerissi, Daniela Micucci, Leonardo Mariani

TL;DR
This paper introduces TOFU-D and COD datasets of Dialogflow chatbots to facilitate empirical research on chatbot quality and security, revealing significant gaps and vulnerabilities in current implementations.
Contribution
It provides the first large-scale curated datasets of Dialogflow chatbots and demonstrates their use in assessing quality and security issues.
Findings
Identified gaps in test coverage
Detected frequent security vulnerabilities
Highlighted the need for systematic multi-platform research
Abstract
In recent years, chatbots have gained widespread adoption thanks to their ability to assist users at any time and across diverse domains. However, the lack of large-scale curated datasets limits research on their quality and reliability. This paper presents TOFU-D, a snapshot of 1,788 Dialogflow chatbots from GitHub, and COD, a curated subset of TOFU-D including 185 validated chatbots. The two datasets capture a wide range of domains, languages, and implementation patterns, offering a sound basis for empirical studies on chatbot quality and security. A preliminary assessment using the Botium testing framework and the Bandit static analyzer revealed gaps in test coverage and frequent security vulnerabilities in several chatbots, highlighting the need for systematic, multi-Platform research on chatbot quality and security.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Spreadsheets and End-User Computing · Artificial Intelligence in Healthcare and Education
