Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers

Sindhuja Penchala; Saketh Reddy Kontham; Prachi Bhattacharjee; Nima Mahmoodi; Daniel Fonseca; Sareh Karami; Mehdi Ghahremani; Andy D. Perkins; Shahram Rahimi; and Noorbakhsh Amiri Golilarz

arXiv:2508.15782·q-bio.NC·December 16, 2025

Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers

Sindhuja Penchala, Saketh Reddy Kontham, Prachi Bhattacharjee, Nima Mahmoodi, Daniel Fonseca, Sareh Karami, Mehdi Ghahremani, Andy D. Perkins, Shahram Rahimi, and Noorbakhsh Amiri Golilarz

PDF

TL;DR

This paper demonstrates that Vision Transformers, especially Swin Transformer, can effectively classify children's engagement in educational settings with high accuracy, enabling scalable automated analysis.

Contribution

It introduces a transformer-based approach for detecting behavioral and collaborative engagement in children using visual cues, outperforming existing models.

Findings

01

Swin Transformer achieved 97.58% accuracy in engagement classification.

02

Transformer models effectively capture local and global visual cues.

03

The approach shows promise for real-world educational engagement analysis.

Abstract

In early childhood education, accurately detecting collaborative and behavioral engagement is essential to foster meaningful learning experiences. This paper presents an AI driven approach that leverages Vision Transformers (ViTs) to automatically classify children s engagement using visual cues such as gaze direction, interaction, and peer collaboration. Utilizing the ChildPlay gaze dataset, our method is trained on annotated video segments to classify behavioral and collaborative engagement states (e.g., engaged, not engaged, collaborative, not collaborative). We evaluated six state of the art transformer models: Vision Transformer (ViT), Data efficient Image Transformer (DeiT), Swin Transformer, VitGaze, APVit and GazeTR. Among these, the Swin Transformer achieved the highest classification performance with an accuracy of 97.58 percent, demonstrating its effectiveness in modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.