Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
Likai Peng, Shihui Feng

TL;DR
This paper compares single-agent and multi-agent Vision Language Model frameworks for automated analysis of collaborative learning behaviors in videos, showing multi-agent systems outperform individual models.
Contribution
It introduces and evaluates two novel multi-agent frameworks for automated video coding, enhancing accuracy over single VLMs in educational research contexts.
Findings
Multi-agent frameworks outperform single VLMs in scene detection.
Workflow-based MAS excels in scene segmentation accuracy.
Autonomous-decision MAS achieves superior action detection.
Abstract
On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
