MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers
Nicole Cho, William Watson

TL;DR
MultiQ&A is a scalable, automated framework that assesses the robustness and consistency of large language models' answers by crowdsourcing question perturbations and analyzing their responses at scale.
Contribution
We introduce MultiQ&A, a novel systematic approach for evaluating LLM robustness through automated crowdsourcing of question perturbations and answer analysis.
Findings
Encompasses analysis of 1.9 million question perturbations and 2.3 million answers.
Shows ensembled LLMs like gpt-3.5-turbo are relatively robust under perturbations.
Provides a framework for measuring confidence, consistency, and hallucinations in LLM responses.
Abstract
One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Expert finding and Q&A systems
