Automated Bug Frame Retrieval from Gameplay Videos Using Vision-Language Models
Wentao Lu, Alexander Senchenko, Abram Hindle, Cor-Paul Bezemer

TL;DR
This paper presents an automated pipeline that extracts keyframes from gameplay videos and uses vision-language models to match them with bug descriptions, significantly reducing manual review effort in game bug triage.
Contribution
The authors introduce a novel automated method combining keyframe extraction and vision-language models for efficient bug report verification in gaming.
Findings
Achieves an F1 score of 0.79 and accuracy of 0.89 in matching bug frames.
Captures bug moments in 98.79% of cases with only 1.90% of original frames.
Performs best in Lighting & Shadow, Physics & Collision, and UI & HUD categories.
Abstract
Modern game studios deliver new builds and patches at a rapid pace, generating thousands of bug reports, many of which embed gameplay videos. To verify and triage these bug reports, developers must watch the submitted videos. This manual review is labour-intensive, slow, and hard to scale. In this paper, we introduce an automated pipeline that reduces each video to a single frame that best matches the reported bug description, giving developers instant visual evidence that pinpoints the bug. Our pipeline begins with FFmpeg for keyframe extraction, reducing each video to a median of just 1.90% of its original frames while still capturing bug moments in 98.79 of cases. These keyframes are then evaluated by a vision--language model (GPT-4o), which ranks them based on how well they match the textual bug description and selects the most representative frame. We evaluated this approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
