Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, Seungwhan Moon

TL;DR
This paper introduces a comprehensive framework for developing real-time proactive AI assistants that generate dialogue responses from streaming egocentric videos, addressing data collection and evaluation challenges.
Contribution
It presents a new data synthesis pipeline, automatic evaluation metrics validated by human studies, and an end-to-end model for real-time dialogue generation from streaming videos.
Findings
Created a large-scale synthetic dialogue dataset from egocentric videos
Validated automatic evaluation metrics with extensive human studies
Developed an end-to-end model for real-time dialogue generation from streaming videos
Abstract
Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Social Robot Interaction and HRI
