ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models
Sidong Feng, Dingbang Wang, Nikola Tomic, Tingting Yu, Aldeida Aleti, Chunyang Chen

TL;DR
ViBR is an automated system that uses vision-language models to reproduce bugs from GUI videos, improving accuracy over previous methods and reducing setup complexity.
Contribution
It introduces a fully automated approach combining CLIP embeddings and VLMs for bug reproduction from videos, eliminating the need for app-specific instrumentation.
Findings
Successfully reproduces 72% of bug recordings
Outperforms state-of-the-art baselines
Significantly improves bug reproduction accuracy
Abstract
Bug reports play a critical role in software maintenance by helping users convey encountered issues to developers. Recently, GUI screen capture videos have gained popularity as a bug reporting artifact due to their ease of use and ability to retain rich contextual information. However, automatically reproducing bugs from such recordings remains a significant challenge. Existing methods often rely on fragile image-processing heuristics, explicit touch indicators, or pre-constructed UI transition graphs, which require non-trivial instrumentation and app-specific setup. This paper presents ViBR, a lightweight and fully automated approach that reproduces bugs directly from GUI recordings. Specifically, ViBR combines CLIP-based embedding similarity for action boundary segmentation with Vision-Language Models (VLMs) for region-aware GUI state comparison and guided bug replay. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
