From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov

TL;DR
This paper formalizes 'vibe-testing', an informal user-centric evaluation method for LLMs, and demonstrates how personalized prompts and subjective criteria influence model preferences, bridging the gap between benchmarks and real-world use.
Contribution
It introduces a formal framework for vibe-testing, analyzing user practices, and presents a prototype pipeline that personalizes evaluation to better reflect real-world preferences.
Findings
Personalized prompts and subjective evaluation criteria can alter model preferences.
Vibe-testing practices vary widely among users and contexts.
Formalizing vibe-testing helps connect benchmark results with real-world usefulness.
Abstract
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
