From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Itay Itzhak; Eliya Habba; Gabriel Stanovsky; Yonatan Belinkov

arXiv:2604.14137·cs.CL·April 17, 2026

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov

PDF

TL;DR

This paper formalizes 'vibe-testing', an informal user-centric evaluation method for LLMs, and demonstrates how personalized prompts and subjective criteria influence model preferences, bridging the gap between benchmarks and real-world use.

Contribution

It introduces a formal framework for vibe-testing, analyzing user practices, and presents a prototype pipeline that personalizes evaluation to better reflect real-world preferences.

Findings

01

Personalized prompts and subjective evaluation criteria can alter model preferences.

02

Vibe-testing practices vary widely among users and contexts.

03

Formalizing vibe-testing helps connect benchmark results with real-world usefulness.

Abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.