VideoGameBunny: Towards vision assistants for video games
Mohammad Reza Taesiri, Cor-Paul Bezemer

TL;DR
VideoGameBunny is a specialized multimodal model designed to improve understanding and analysis of video game images, outperforming larger models through a tailored dataset and training approach, enabling advances in game playing, commentary, and debugging.
Contribution
The paper introduces VideoGameBunny, a multimodal model trained on a large video game image dataset, achieving superior performance with fewer parameters compared to existing models.
Findings
High-quality game data enhances model performance.
Small models can outperform larger models with specialized training.
The dataset and model facilitate future research in game understanding.
Abstract
Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Video Surveillance and Tracking Methods
