Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Likai Peng; Shihui Feng

arXiv:2604.03631·cs.AI·April 7, 2026

Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Likai Peng, Shihui Feng

PDF

TL;DR

This paper compares single-agent and multi-agent Vision Language Model frameworks for automated analysis of collaborative learning behaviors in videos, showing multi-agent systems outperform individual models.

Contribution

It introduces and evaluates two novel multi-agent frameworks for automated video coding, enhancing accuracy over single VLMs in educational research contexts.

Findings

01

Multi-agent frameworks outperform single VLMs in scene detection.

02

Workflow-based MAS excels in scene segmentation accuracy.

03

Autonomous-decision MAS achieves superior action detection.

Abstract

On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.