# Evaluating Variable-Length Multiple-Option Lists in Chatbots and Mobile   Search

**Authors:** Pepa Atanasova, Georgi Karadzhov, Yasen Kiprov, Preslav Nakov,, Fabrizio Sebastiani

arXiv: 1905.10565 · 2021-09-22

## TL;DR

This paper investigates how to generate and evaluate variable-length answer lists in chatbots and mobile search, aiming to balance informativeness and user experience, and introduces new evaluation metrics tailored for this task.

## Contribution

It defines properties for evaluating variable-length answer lists, analyzes limitations of existing measures, and proposes novel evaluation metrics suited for chatbot and mobile search applications.

## Key findings

- Existing IR evaluation measures are inadequate for variable-length lists in chatbots.
- Proposed new evaluation metrics better capture the quality of answer lists.
- Guidelines for producing optimal answer list lengths to improve user satisfaction.

## Abstract

In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the "ten blue links'' metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has revived the interest in Question Answering. Then, along came chatbots, conversational systems, and messaging platforms, where the user needs could be better served with the system asking follow-up questions in order to better understand the user's intent. While typically a user would expect a single response at any utterance, a system could also return multiple options for the user to select from, based on different system understandings of the user's intent. However, this possibility should not be overused, as this practice could confuse and/or annoy the user. How to produce good variable-length lists, given the conflicting objectives of staying short while maximizing the likelihood of having a correct answer included in the list, is an underexplored problem. It is also unclear how to evaluate a system that tries to do that. Here we aim to bridge this gap. In particular, we define some necessary and some optional properties that an evaluation measure fit for this purpose should have. We further show that existing evaluation measures from the IR tradition are not entirely suitable for this setup, and we propose novel evaluation measures that address it satisfactorily.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.10565/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/1905.10565/full.md

---
Source: https://tomesphere.com/paper/1905.10565