Can Large Language Models Capture Video Game Engagement?
David Melhart, Matthew Barthet, Georgios N. Yannakakis

TL;DR
This study evaluates the ability of large language models to predict human affect and engagement in video game footage, revealing their strengths and limitations in multimodal emotion recognition tasks.
Contribution
First comprehensive evaluation of LLMs for multimodal affect prediction in video game footage, analyzing factors affecting performance and highlighting future research directions.
Findings
LLMs outperform traditional baselines in some domains
Performance varies significantly across different games
LLMs generally lag behind human annotations in continuous affect labeling
Abstract
Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs for successfully predicting continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. In this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 4,800 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains and able to outperform traditional machine learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Games and Media · Artificial Intelligence in Games · Sports Analytics and Performance
