Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
Lavisha Aggarwal, Vikas Bahirwani, Lin Li, Andrea Colaco

TL;DR
This paper introduces HowToDIV, a large-scale dataset of task-oriented dialogues generated from instructional videos, enabling AI agents to assist users in complex, multi-step real-world tasks through a new benchmark.
Contribution
It presents a fully automatic method to generate task-guidance dialogues from instructional videos using large language models, and establishes a new dataset and benchmark for procedural task assistance.
Findings
Created HowToDIV dataset with 507 conversations and 6636 QA pairs.
Demonstrated baseline performance using Gemma-3 model.
Showed effectiveness of automatic dialogue generation for real-world tasks.
Abstract
Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
