Evaluating Variable-Length Multiple-Option Lists in Chatbots and Mobile Search
Pepa Atanasova, Georgi Karadzhov, Yasen Kiprov, Preslav Nakov,, Fabrizio Sebastiani

TL;DR
This paper investigates how to generate and evaluate variable-length answer lists in chatbots and mobile search, aiming to balance informativeness and user experience, and introduces new evaluation metrics tailored for this task.
Contribution
It defines properties for evaluating variable-length answer lists, analyzes limitations of existing measures, and proposes novel evaluation metrics suited for chatbot and mobile search applications.
Findings
Existing IR evaluation measures are inadequate for variable-length lists in chatbots.
Proposed new evaluation metrics better capture the quality of answer lists.
Guidelines for producing optimal answer list lengths to improve user satisfaction.
Abstract
In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the "ten blue links'' metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has revived the interest in Question Answering. Then, along came chatbots, conversational systems, and messaging platforms, where the user needs could be better served with the system asking follow-up questions in order to better understand the user's intent. While typically a user would expect a single response at any utterance, a system could also return multiple options for the user to select from, based on different system understandings of the user's intent. However, this possibility should not be overused, as this practice could confuse and/or…
| Unranked Retrieval | Ranked Retrieval | |||||||||||||
| Result list | Gold | LAR | Gold | AP | APL | APs | RR | nDCG | nDCGL | RBP | RBPL | OLAR | ||
| c | 1 | 1.00 | 1.00 | 1.00 | 1 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.000 |
| cw | 2 | 0.67 | 0.80 | 0.75 | 2 | (*) 1.00 | 0.83 | 0.83 | (*) 1.00 | (*) 1.00 | 0.92 | (*) 0.50 | 0.75 | 0.756 |
| wc | 2 | 0.67 | 0.80 | 0.75 | 3 | 0.50 | 0.58 | 0.58 | 0.50 | 0.63 | 0.69 | 0.25 | 0.50 | 0.744 |
| cww | 4 | 0.50 | 0.67 | 0.67 | 4 | (*) 1.00 | (*) 0.75 | (*) 0.75 | (*) 1.00 | (*) 1.00 | (*) 0.88 | (*) 0.50 | (*) 0.63 | 0.675 |
| wcw | 4 | 0.50 | 0.67 | 0.67 | 5 | (*) 0.50 | 0.50 | 0.50 | (*) 0.50 | (*) 0.63 | 0.65 | (*) 0.25 | 0.38 | 0.663 |
| wwc | 4 | 0.50 | 0.67 | 0.67 | 6 | 0.33 | 0.42 | 0.42 | 0.33 | 0.50 | 0.57 | 0.13 | 0.25 | 0.659 |
| cwww | 7 | 0.40 | 0.57 | 0.63 | 7 | (*) 1.00 | (*) 0.70 | (*) 0.70 | (*) 1.00 | (*) 1.00 | (*) 0.85 | (*) 0.50 | (*) 0.56 | 0.634 |
| wcww | 7 | 0.40 | 0.57 | 0.63 | 8 | (*) 0.50 | (*) 0.45 | (*) 0.45 | (*) 0.50 | (*) 0.63 | (*) 0.62 | (*) 0.25 | (*) 0.31 | 0.622 |
| wwcw | 7 | 0.40 | 0.57 | 0.63 | 9 | (*) 0.33 | 0.37 | 0.37 | (*) 0.33 | (*) 0.50 | 0.54 | (*) 0.13 | 0.19 | 0.618 |
| wwwc | 7 | 0.40 | 0.57 | 0.63 | 10 | 0.25 | 0.33 | 0.33 | 0.25 | 0.43 | 0.50 | 0.06 | 0.13 | 0.616 |
| cwwww | 11 | 0.33 | 0.50 | 0.60 | 11 | (*) 1.00 | (*) 0.67 | (*) 0.67 | (*) 1.00 | (*) 1.00 | (*) 0.83 | (*) 0.50 | (*) 0.53 | 0.610 |
| wcwww | 11 | 0.33 | 0.50 | 0.60 | 12 | (*) 0.50 | (*) 0.42 | (*) 0.42 | (*) 0.50 | (*) 0.63 | (*) 0.61 | (*) 0.25 | (*) 0.28 | 0.598 |
| wwcww | 11 | 0.33 | 0.50 | 0.60 | 13 | (*) 0.33 | (*) 0.33 | (*) 0.33 | (*) 0.33 | (*) 0.50 | (*) 0.52 | (*) 0.13 | (*) 0.16 | 0.594 |
| wwwcw | 11 | 0.33 | 0.50 | 0.60 | 14 | (*) 0.25 | 0.29 | 0.29 | (*) 0.25 | (*) 0.43 | 0.48 | (*) 0.06 | 0.09 | 0.591 |
| wwwwc | 11 | 0.33 | 0.50 | 0.60 | 15 | 0.20 | 0.27 | 0.27 | 0.20 | 0.39 | 0.46 | 0.03 | 0.06 | 0.590 |
| w | 16 | 0.00 | () 0.50 | 0.50 | 16 | 0.00 | 0.00 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.488 |
| ww | 17 | (*) 0.00 | 0.40 | 0.25 | 17 | (*) 0.00 | (*) 0.00 | 0.17 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | 0.244 |
| www | 18 | (*) 0.00 | 0.33 | 0.17 | 18 | (*) 0.00 | (*) 0.00 | 0.13 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | 0.163 |
| wwww | 19 | (*) 0.00 | 0.29 | 0.13 | 19 | (*) 0.00 | (*) 0.00 | 0.10 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | 0.122 |
| wwwww | 20 | (*) 0.00 | 0.25 | 0.10 | 20 | (*) 0.00 | (*) 0.00 | 0.08 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | (*) 0.00 | 0.098 |
| Correctness | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | ||
| Confidence | No | Yes | Yes | No | No | No | No | No | No | No | No | Yes | ||
| Priority | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | ||
| Kendall’s Tau | 0.970 | 0.985 | 1 | 0.746 | 0.827 | 0.857 | 0.746 | 0.746 | 0.811 | 0.746 | 0.811 | 1 | ||
| Spearman correlation | 0.992 | 0.994 | 1 | 0.855 | 0.926 | 0.934 | 0.855 | 0.855 | 0.918 | 0.855 | 0.918 | 1 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Evaluating Variable-Length Multiple-Option Lists
in Chatbots and Mobile Search
Pepa Atanasova
University of Copenhagen, Copenhagen, Denmark
,
Georgi Karadzhov, Yasen Kiprov
SiteGround Hosting EOOD, Sofia, Bulgaria
,
Preslav Nakov
Qatar Computing Research Institute, HBKU, Doha, Qatar
and
Fabrizio Sebastiani
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, 56124 Pisa, Italy
(2019)
Abstract.
In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the “ten blue links” metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has revived the interest in Question Answering. Then, along came chatbots, conversational systems, and messaging platforms, where the user needs could be better served with the system asking follow-up questions in order to better understand the user’s intent. While typically a user would expect a single response at any utterance, a system could also return multiple options for the user to select from, based on different system understandings of the user’s intent. However, this possibility should not be overused, as this practice could confuse and/or annoy the user. How to produce good variable-length lists, given the conflicting objectives of staying short while maximizing the likelihood of having a correct answer included in the list, is an underexplored problem. It is also unclear how to evaluate a system that tries to do that. Here we aim to bridge this gap. In particular, we define some necessary and some optional properties that an evaluation measure fit for this purpose should have. We further show that existing evaluation measures from the IR tradition are not entirely suitable for this setup, and we propose novel evaluation measures that address it satisfactorily.
Chatbots, Mobile Search, Evaluation Measures.
††journalyear: 2019††copyright: acmlicensed††conference: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval; July 21–25, 2019; Paris, France††booktitle: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19), July 21–25, 2019, Paris, France††price: 15.00††doi: 10.1145/3331184.3331308††isbn: 978-1-4503-6172-9/19/07
1. Introduction
Mobile devices have emerged as an essential and integral part of our lives. Yet, the limited size of mobile screens, and the consequently reduced amount of displayed information, have brought about new challenges in terms of user experience. The first step towards addressing them is to depart from the “ten blue links” metaphor and to actually understand the user’s information needs (Baeza-Yates et al., 2011; Russell-Rose and Tate, 2012). Moreover, chatbots have been introduced as a way to include a follow-up interaction with the user (Følstad and Brandtzæg, 2017).
Most chatbots in actual use (Coucke et al., 2018; Bocklisch et al., 2017; Burtsev and other 19 authors, 2018; Williams et al., 2015) share a common characteristic: they are capable of handling different types of user needs, usually called intents. As the chatbot might engage in an elaborate series of actions and utterances to fulfill the user’s intent, a high-accuracy intent detection module becomes a crucial component of such systems. Yet, as the number of intents being handled grows, the accuracy of the intent classifier tends to decrease, with reported values down to 0.73 and even to 0.52 (Braun et al., 2017; Coucke et al., 2018), depending on the domain and the number of intents considered.
In order to mitigate intent misclassification, personal assistants can use strategies such as continue with the most likely intent, ask for confirmation, return a list of possible intents, or repeat the question. Previous research has found that users prefer a list of the most likely intents, but also noted that this “complicates with clutter; unnatural; more reading” (Ashktorab et al., 2019), i.e., the list should be concise.
To this end, toolkits such as the IBM Watson Assistant111http://console.bluemix.net/docs/services/assistant/dialog-runtime.html and Oracle’s Digital Assistant222https://bit.ly/2XbW1Do provide functionality for defining confidence thresholds, which allow more candidate intents to be displayed when the model is not confident. This means less typing by the user and faster narrowing down the user’s request (Ashktorab et al., 2019).
The examples we present below show that going in the wrong direction might have very negative consequences. Instead, the system could present a list of highly likely options and then leave it to the user to select the correct one, e.g.,
[TABLE]
The system should be careful, though, not to suggest too many options, as this could confuse and/or annoy the user (Chaves and Gerosa, 2019). Moreover, it should try to make sure the list contains a good suggestion. Depending on the presentation mode, the order in which the options are presented might or might not matter.
Current research and development for chatbots moves in the direction of open-domain task-oriented systems, with an ever-increasing number of intents, which makes variable-length lists much more common. However, the evaluation of such systems remains an underexplored problem, and existing measures from the IR tradition do not fit this setup well enough. Therefore, we define a set of properties that an evaluation measure should satisfy in order to optimize two conflicting objectives simultaneously, i.e., (i) reduce the size of the list, and (ii) maximize the likelihood that a good option is indeed in the list. We further propose evaluation measures that satisfy all these desiderata. While here we focus on chatbots, most of the arguments we present apply to mobile search, too.
2. Assumptions and Desiderata
Our goal is to evaluate systems that, given a user question, try to understand the underlying intent and to answer with a suitable response. We assume that the question expresses a single user intent and therefore the system can return a single correct response. 333We leave the case of multiple possible correct answers for future work. We further assume that the system always returns a non-empty list of responses,444We assume a special default intent with a default answer to cover the case when the system cannot understand the intent. and that different responses correspond to different intents. We represent response lists as sequences of symbols from , where stands for a correct response and stands for a wrong one. Finally, we assume that the position of the correct answer (if returned) in the list may or may not matter, depending on the context. In other words, we cater for the fact that in some applications the results should be considered a plain unordered set, while in some others they should form a ranked list.
Next, we define a set of properties that an evaluation measure for variable-length lists should satisfy. Given two response lists and for the same user question, a property specifies which one should give a higher score. Note that we take to be a measure of accuracy, and not of error, i.e., higher values of are better. We use to refer to the response item at the -th position of . We further define a function that, given a response list , returns the number of responses of type (where ) that contains. Moreover, we define a function that, given a response list , returns the reciprocal rank of the correct response, or 0 if no correct response was returned. Finally, we define a re-scaling function , which re-scales from the range [0, 1] to the range [0, newMAX]. We use for the length of , and the symbol to express preference between two lists.
Property 1.
*(Correctness) If ,
then . ∎*
This property states that a response list that contains a correct response should be preferred to one that does not.
Property 2.
*(Confidence) If
and , then . ∎*
This property states that, if two lists contain the same number of correct responses (can be 0 or 1), the list with fewer wrong responses is preferable. The aim is to limit the length of the response list.
Property 3.
(Priority) If and and , then . ∎
This property states that if two lists both contain a correct response, then the list where the correct response is ranked higher should be preferred.
We view Correctness and Confidence as mandatory properties for all our measures, and Priority as an optional one, depending on whether the results from the chatbot application are presented as an unordered set or as a ranked list.
3. Evaluation Measures
In Table 1, we present an evaluation of existing information retrieval measures w.r.t. the introduced properties. The response lists are ranked according to the properties at hand, and together with the priority between the properties, we obtain a unique “gold ranking.”
Next, we compare the evaluation scores and the resulting rank order of the various measures with respect to the “gold ranking.” To this end, we estimate Kendall’s Tau and Spearman’s rank correlation between the two and we indicate the errors in the rank positions. We also provide information about the properties that each of the measures satisfies or violates.
Existing Measures for Unranked Retrieval. Considering evaluation of unordered sets, one obvious candidate is . We can see in Table 1 that is successful at rewarding the presence of the correct answer (due to the recall component) and, usually, at minimizing the length of the response list (due to the precision component). Unfortunately, its value is always 0 when there is no correct response, Thus, it fails to satisfy Confidence.
We try to solve the problem by smoothing (denoted as ), which we obtain by appending an extra correct response at the end of each list. The resulting measure does not suffer from the above problems of , but fails to distinguish between a list consisting of a single wrong response and lists with one correct and four wrong responses, thus failing to satisfy Correctness.
Existing Measures for Ranked Retrieval. A natural candidate measure for ranked retrieval is Average Precision (AP). It computes precision after each relevant response, thus satisfying both Correctness and Priority. However, it disregards the number of the returned irrelevant responses and even stops computing at the last relevant response, ignoring all the subsequent irrelevant responses, and thus it fails to satisfy Confidence.
In order to allow the length of the response list to influence the evaluation score, it has been proposed (Liu et al., 2016) to append a terminal response t at the end of each response list, which is called the measure. manages to alleviate some of the ranking problems observed in AP, but still fails to satisfy Confidence in some cases.
Another way to make AP penalize wrong responses at the end of the list is to use smoothing (denoted as ). Now, the more wrong responses we return at the end of the list, the lower the precision at the last recall level will get. As a result, manages to reduce further the number of errors in the ranking with respect to the gold order. Nevertheless, still does not satisfy Confidence in cases when the result lists have different numbers of wrong results and different positions of the correct responses, as in wcw and cwww. This is due to failing to apply the properties in the correct order, i.e., applying Priority before Confidence.
Reciprocal Rank (RR) is a popular measure for ranked retrieval, which accounts for the position of the 1st correct response, disregarding any following irrelevant responses. AP is equivalent to RR in the case of a single correct response (and so is DCG).
Furthermore, although normalized Discounted Cumulative Gain (nDCG) is designed specifically for the evaluation of different relevance scores, we study its performance on our “gold ranking.” However, it does not penalize wrong responses, violating Confidence. Using the technique proposed in (Liu et al., 2016), we end up with , which manages to reduce the number of ranking errors, but still violates Confidence in some cases.
(Moffat and Zobel, 2008) proposed Rank-Biased Precision (RBP), which models a user that decides to continue reading the next item in the response list with probability . As in (Liu et al., 2016), we set =0.5. RBP struggles with the same problems as RR and violates Confidence. Even if we apply the technique from (Liu et al., 2016), the ranking produced by contains a lot of errors compared to the “gold ranking.”
New Measures. The existing measures we have discussed are suitable for optimizing the number of correct responses and their positions. Most of them fulfill Correctness and Priority, but struggle with Confidence. This is not surprising as the IR tradition (except for recent work like (Albahem et al., 2018; Liu et al., 2016)) is concerned with the rank positions of the relevant documents, not with the length of the response list, which is conceptually infinite. Before (Albahem et al., 2018; Liu et al., 2016), this length had never been a parameter in any of the proposed measures.
Smoothing lists by appending an additional correct response is beneficial but not enough to achieve perfect correlation with the “gold ranking,” as the Kendall’s Tau and the Spearman rank correlation scores indicate. In order to bridge this gap, we introduce a new measure, Length-Aware Recall (LAR), which operates on a list of responses and gives preference to lists with fewer negatives:
[TABLE]
The new measure first computes the recall of the returned list. Then, it includes the confidence of the system about the correct result expressed by the reciprocal value of the list’s length, i.e., the confidence decreases when the number of returned responses increases. Taking the mean of the two scores, we create an intuitive score of both recall and confidence. The possibility of having zero values for recall makes arithmetic mean preferable to harmonic mean, also used to combine evaluation criteria.
As LAR satisfies both Correctness and Confidence, it can be used to evaluate variable-length lists by modeling the true positive rate and the optimal response length jointly. Moreover, it is perfectly correlated with the “gold ranking” in the unordered scenario.
However, LAR does not satisfy the Priority property, which makes it unfit for scenarios where order does matter. In order to fix this issue, we propose an extension of LAR that includes an additional third term for the rank of the correct response.
This fix gives rise to Ordered Length-Aware Recall (OLAR):
[TABLE]
The third term in the above equation accounts for Priority, and it is larger when the rank of the correct item moves lower in the list. We rescale it because of the priority order that we have defined for the properties – it should not contribute to the score more than the Confidence term. Given that is the smallest difference between two Priority scores of lists with length up to 5, we re-scale it in [0, ], where and is an insignificantly small number.
To sum up, we introduced two measures, which are in a perfect correlation with the two “gold rankings.” The measures consist of separate terms, accounting for different properties, which makes them easily interpretable and extensible for specific needs.
4. Related Work
Related properties of evaluation measures. As we define a set of properties that need to be satisfied by an evaluation measure for variable-length output, our work is closely related to the properties of truncated evaluation measures introduced in (Albahem et al., 2018). Their Relevance Monotonicity property is similar to our Correctness, except that Correctness imposes strict monotonicity. The Irrelevance Monotonicity property is similar to our Confidence, but it discounts for irrelevant documents at the end of the list only.
(Moffat, 2013) defined seven properties for effectiveness measures. The relevant properties, which our new evaluation measures also satisfy are Completeness, Realisability, Localisation and Boundedness. However, their Monotonicity property contradicts our Confidence property because it states that adding documents, which are not relevant, to the end of the list increases the score.
(Amigó et al., 2018; Sebastiani, 2015) also conducted an “axiomatic” analysis of IR-related evaluation measures. Our properties Confidence and Priority are akin to the ones discussed in (Amigó et al., 2013), but the latter are used in a different setup, i.e., for document retrieval, clustering, and filtering.
Related evaluation measures. In Section 3, we discussed and evaluated the most relevant evaluation measures for both unranked (precision, recall, , and smoothed ) and ranked retrieval (nDCG (Järvelin and Kekäläinen, 2000), RR, MAP, smoothed MAP, RBP (Liu et al., 2016)). We found that they were unable to penalize the wrong responses according to the “gold order,” thus violating Confidence.
Apart from these measures, (Peñas and Rodrigo, 2011) introduced the c@1 measure, a modification of accuracy, suited for systems that may not return responses. However, their approach still does not penalize the number of returned wrong responses at the end of the list. Furthermore, (Peñas and Rodrigo, 2011; Voorhees, 2001; Sakai, 2004) assumed that a system can return an empty result list, which tackles the problem when the request does not have a correct response. However, we assume that the system should return at least one result, even if it is a default fallback intent.
Another relevant field of research analyzes the likelihood that the user will continue exploring the response list based on different signals - time-biased gain (Smucker and Clarke, 2012), length of the snippet (Sakai and Dou, 2013), and information foraging (Azzopardi et al., 2018). However, in mobile search and chatbot platforms, the presented information is already minimized, and thus we aim to reduce the length of the returned results instead.
5. Conclusion and Future Work
We have studied the problem of evaluating variable-length response lists for chatbots, given the conflicting objectives of keeping these lists short while maximizing the likelihood of having a correct response in the list. In particular, we argued for three properties that such an evaluation measure should have. We further showed that existing evaluation measures from the IR tradition do not satisfy all of them. Then, we proposed novel measures that are especially tailored for the described scenarios, and satisfy all properties.
We plan to extend this work to the context in which more than one correct answer might exist, since a long and complex input question may contain multiple intents (Xu and Sarikaya, 2013). This would also be of interest for mobile retrieval in general, where results may need to be truncated due to limited screen space.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Albahem et al . (2018) A. Albahem, D. Spina, F. Scholer, A. Moffat, and L. Cavedon. 2018. Desirable Properties for Diversity and Truncated Effectiveness Metrics. In Proc. of ADCS .
- 3Amigó et al . (2013) E. Amigó, J. Gonzalo, and F. Verdejo. 2013. A General Evaluation Measure for Document Organization Tasks. In Proceedings of SIGIR .
- 4Amigó et al . (2018) E. Amigó, D. Spina, and J. Carrillo de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In Proceedings of SIGIR .
- 5Ashktorab et al . (2019) Z. Ashktorab, M. Jain, Q. V. Liao, and J. D Weisz. 2019. Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns. In Proceedings of CHI .
- 6Azzopardi et al . (2018) L. Azzopardi, P. Thomas, and N. Craswell. 2018. Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure. In Proc. of SIGIR .
- 7Baeza-Yates et al . (2011) R. Baeza-Yates, A. Z. Broder, and Y. Maarek. 2011. The New Frontier of Web Search Technology: Seven Challenges. In Search Computing . Springer, 3–9.
- 8Bocklisch et al . (2017) T. Bocklisch, J. Faulker, N. Pawlowski, and A. Nichol. 2017. Rasa: Open Source Language Understanding and Dialogue Management. (2017). ar Xiv:1712.05181.
