From Videos to Conversations: Egocentric Instructions for Task Assistance
Lavisha Aggarwal, Vikas Bahirwani, and Andrea Colaco

TL;DR
This paper introduces a scalable method to convert instructional videos into multimodal conversations, creating a new dataset for training AI agents in task assistance, and provides baseline results for future research.
Contribution
The authors present an automatic pipeline to generate multimodal task-guidance conversations from videos, resulting in the HowToDIV dataset and initial benchmarks for multimodal procedural assistance.
Findings
Created the HowToDIV dataset with 507 conversations and 6,636 QA pairs.
Demonstrated baseline performance using Gemma 3 and Qwen 2.5 models.
Established initial benchmarks for multimodal task assistance.
Abstract
Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
