Benchmarking Natural Language Understanding Services for building Conversational Agents
Xingkun Liu, Arash Eshghi, Pawel Swietojanski, Verena Rieser

TL;DR
This paper provides a comprehensive evaluation of popular NLU services across multiple domains, highlighting their strengths and weaknesses in intent classification and entity recognition.
Contribution
It presents the first large-scale, multi-domain benchmarking of NLU tools, offering valuable insights into their comparative performance.
Findings
Watson outperforms others in intent classification
Watson performs poorly in entity type recognition due to low precision
Dialogflow, LUIS, and Rasa perform well across tasks
Abstract
We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission. The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision. Again, Dialogflow, LUIS and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
