Leveraging Vision-Language Models to Detect Attention in Educational Videos
Gabriel Becquet (LIP6, CNRS, SU), S\'ebastien Lall\'e (CNRS, LIP6, SU), Vanda Luengo (LIP6, CNRS, SU), Ali Abou-Hassan (SU, CNRS, PHENIX, IUF)

TL;DR
This paper explores using vision-language foundation models to detect learner attention in educational videos, aiming to improve over traditional eye-tracking methods, but finds current models do not outperform statistical baselines.
Contribution
It introduces a novel VLM-based methodology for analyzing gaze data in educational videos, highlighting its potential and current limitations.
Findings
VLM approach did not outperform statistical baselines.
Using Gemini 3 with various prompts was ineffective.
Insights into the limitations of foundation models for real-time attention detection.
Abstract
Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
