Leveraging Vision-Language Models to Detect Attention in Educational Videos

Gabriel Becquet (LIP6; CNRS; SU); S\'ebastien Lall\'e (CNRS; LIP6; SU); Vanda Luengo (LIP6; CNRS; SU); Ali Abou-Hassan (SU; CNRS; PHENIX; IUF)

arXiv:2605.20211·cs.CV·May 21, 2026

Leveraging Vision-Language Models to Detect Attention in Educational Videos

Gabriel Becquet (LIP6, CNRS, SU), S\'ebastien Lall\'e (CNRS, LIP6, SU), Vanda Luengo (LIP6, CNRS, SU), Ali Abou-Hassan (SU, CNRS, PHENIX, IUF)

PDF

TL;DR

This paper explores using vision-language foundation models to detect learner attention in educational videos, aiming to improve over traditional eye-tracking methods, but finds current models do not outperform statistical baselines.

Contribution

It introduces a novel VLM-based methodology for analyzing gaze data in educational videos, highlighting its potential and current limitations.

Findings

01

VLM approach did not outperform statistical baselines.

02

Using Gemini 3 with various prompts was ineffective.

03

Insights into the limitations of foundation models for real-time attention detection.

Abstract

Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.