Substantial, Decomposable, and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks
Yayuan Li,Chenglin Li,Jingying Wang,Filippos Bellos,Anhong Guo,Jason J. Corso

TL;DR
This study investigates how visual context mismatch in instructional videos affects task performance, revealing that misalignment significantly degrades quality and speed but remains unnoticed by users.
Contribution
It introduces a systematic analysis of visual context attributes in instructional videos and demonstrates their impact on task success and user perception.
Findings
Aligned videos improve task completion quality by 11.1% and speed by 15.5%.
Four visual attributes significantly affect task performance when misaligned.
Users do not perceive the impact of single-attribute misalignment despite objective performance drops.
Abstract
Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user's real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) -- fully aligned with the user's visual perception -- to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
