MultiQ&A: An Analysis in Measuring Robustness via Automated   Crowdsourcing of Question Perturbations and Answers

Nicole Cho; William Watson

arXiv:2502.03711·cs.CL·February 7, 2025

MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

Nicole Cho, William Watson

PDF

Open Access

TL;DR

MultiQ&A is a scalable, automated framework that assesses the robustness and consistency of large language models' answers by crowdsourcing question perturbations and analyzing their responses at scale.

Contribution

We introduce MultiQ&A, a novel systematic approach for evaluating LLM robustness through automated crowdsourcing of question perturbations and answer analysis.

Findings

01

Encompasses analysis of 1.9 million question perturbations and 2.3 million answers.

02

Shows ensembled LLMs like gpt-3.5-turbo are relatively robust under perturbations.

03

Provides a framework for measuring confidence, consistency, and hallucinations in LLM responses.

Abstract

One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Expert finding and Q&A systems