Evaluating Human-Language Model Interaction
Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus,, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda, Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi, Bommasani, Michael Bernstein, Percy Liang

TL;DR
This paper introduces HALIE, a comprehensive framework for evaluating human-LM interaction that considers the interactive process, subjective experience, and preferences, revealing that non-interactive metrics often do not predict interactive performance.
Contribution
The paper develops HALIE, a novel evaluation framework for human-LM interaction, and demonstrates its effectiveness across diverse tasks and models, highlighting discrepancies with traditional non-interactive metrics.
Findings
Better non-interactive performance does not always mean better human-LM interaction
Interactive and non-interactive metrics can diverge in evaluation results
Human-centric evaluation captures aspects missed by standard benchmarks
Abstract
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dropout · Layer Normalization
