VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

Srijan Bansal; Jiao Fangkai; Yilun Zhou; Austin Xu; Shafiq Joty; Semih Yavuz

arXiv:2603.15921·cs.SE·March 18, 2026

VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

Srijan Bansal, Jiao Fangkai, Yilun Zhou, Austin Xu, Shafiq Joty, Semih Yavuz

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VIBEPASS, an empirical framework for evaluating large language models' ability to generate diagnostic tests and repair code faults, revealing current limitations in fault reasoning despite high syntactic test validity.

Contribution

It presents the first systematic evaluation of models' fault-triggering and fault-targeted repair capabilities, highlighting the bottleneck in fault reasoning over code generation.

Findings

01

Models produce valid tests at high rates but struggle with discriminative fault detection.

02

Fault hypothesis generation is the main bottleneck, not test validity.

03

Self-generated tests can effectively guide repairs when faults are witnessed.

Abstract

As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated. We present \name{}, the first empirical decomposition that jointly evaluates two coupled tasks: \emph{Fault-Triggering Test Generation (FT-Test)} constructing a discriminative witness that exposes a latent bug, and \emph{Fault-targeted Program Repair (FPR)}, repairing it under varying diagnostic conditions. \name{} pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs, we find that fault-targeted reasoning does not scale with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Salesforce/vibepass
dataset· 40 dl
40 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Machine Learning and Algorithms