IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question   Answering

Ruosen Li; Ruochen Li; Barry Wang; Xinya Du

arXiv:2408.13545·cs.CL·November 19, 2024·3 cites

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering

Ruosen Li, Ruochen Li, Barry Wang, Xinya Du

PDF

Open Access 1 Video

TL;DR

IQA-EVAL introduces an automatic, scalable framework using LLMs to simulate human interactions and evaluate large language models in interactive question answering, closely aligning with human judgments.

Contribution

The paper presents a novel LLM-based evaluation agent that simulates human behavior and interactions for automatic evaluation of LLMs in interactive QA tasks, improving scalability and correlation with human assessments.

Findings

01

High correlation between IQA-EVAL and human evaluations.

02

Persona assignment enhances evaluation accuracy.

03

Efficient evaluation of multiple LLMs on complex questions.

Abstract

To evaluate Large Language Models (LLMs) for question answering (QA), traditional methods typically focus on assessing single-turn responses to given questions. However, this approach doesn't capture the dynamic nature of human-AI interactions, where humans actively seek information through conversation. Recent works in human-computer interaction (HCI) have employed human evaluators to conduct interactions and evaluations, but they are often prohibitively expensive and time-consuming to scale. We introduce an automatic evaluation framework IQA-EVAL to achieve Interactive Question Answering Evaluations, more specifically, we introduce a LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions. Moreover, we propose assigning personas to LEAs to better simulate groups of real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering· slideslive

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning

MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings