Evaluating Human-Language Model Interaction

Mina Lee; Megha Srivastava; Amelia Hardy; John Thickstun; Esin Durmus,; Ashwin Paranjape; Ines Gerard-Ursin; Xiang Lisa Li; Faisal Ladhak; Frieda; Rong; Rose E. Wang; Minae Kwon; Joon Sung Park; Hancheng Cao; Tony Lee; Rishi; Bommasani; Michael Bernstein; Percy Liang

arXiv:2212.09746·cs.CL·January 9, 2024·45 cites

Evaluating Human-Language Model Interaction

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus,, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda, Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi, Bommasani, Michael Bernstein, Percy Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces HALIE, a comprehensive framework for evaluating human-LM interaction that considers the interactive process, subjective experience, and preferences, revealing that non-interactive metrics often do not predict interactive performance.

Contribution

The paper develops HALIE, a novel evaluation framework for human-LM interaction, and demonstrates its effectiveness across diverse tasks and models, highlighting discrepancies with traditional non-interactive metrics.

Findings

01

Better non-interactive performance does not always mean better human-LM interaction

02

Interactive and non-interactive metrics can diverge in evaluation results

03

Human-centric evaluation captures aspects missed by standard benchmarks

Abstract

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanford-crfm/halie
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dropout · Layer Normalization