Beyond Play and Pause: Turning GPT-4o Spatial Weakness into a Strength for In-Depth Interactive Video Learning
Sajad Goudarzi, Samaneh Zamanifard

TL;DR
This paper presents Untwist, an AI system that transforms passive video learning into an interactive experience by enabling region-specific questions and responses, leveraging GPT-4o's spatial weaknesses with annotated frames for improved accuracy.
Contribution
The paper introduces Untwist, a novel system integrating GPT APIs and computer vision to enable real-time, region-specific video interaction, addressing GPT-4o's spatial limitations with annotated frames.
Findings
Enhanced accuracy in localizing video content using annotated frames
Successful integration of GPT APIs with computer vision for interactive responses
Potential to significantly increase engagement and comprehension in video learning
Abstract
Traditional video-based learning remains passive, offering limited opportunities for users to engage dynamically with content. While current AI-powered tools offer transcription and summarization, they lack real-time, region-specific interaction capabilities. This paper introduces Untwist, an AI-driven system that enables interactive video learning by allowing users to ask questions about the entire video or specific regions using a bounding box, receiving context-aware, multimodal responses. By integrating GPT APIs with Computer Vision techniques, Untwist extracts, processes, and structures video content to enhance comprehension. Our approach addresses GPT-4o spatial weakness by leveraging annotated frames instead of raw coordinate data, significantly improving accuracy in localizing and interpreting video content. This paper describes the system architecture, including video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
