Do Vision Language Models Understand Human Engagement in Games?
Ziyi Wang, Qizan Guo, Rishitosh Singh, Xiyang Hu

TL;DR
This study evaluates vision-language models' ability to infer human engagement from gameplay videos, revealing current models' limitations in understanding complex psychological states despite recognizing visual cues.
Contribution
It systematically assesses multiple prompting strategies on VLMs for engagement prediction across diverse games, highlighting their strengths and weaknesses.
Findings
Zero-shot predictions are generally weak and often baseline.
Memory-augmented prompts improve pointwise prediction in some cases.
Pairwise engagement prediction remains challenging across strategies.
Abstract
Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Media Influence and Health · Digital Games and Media
