TL;DR
This paper introduces a new benchmark and dataset for evaluating multi-modal LLMs in providing real-time, interactive task guidance with mistake detection and correction, focusing on live video-based instruction.
Contribution
It presents Qualcomm Interactive Cooking, a novel benchmark and dataset, and introduces LiveMamba, a streaming multi-modal LLM for interactive instructional guidance.
Findings
State-of-the-art models struggle with real-time guidance and mistake detection.
The benchmark enables evaluation of models in live, situated coaching scenarios.
LiveMamba serves as a strong baseline for future research.
Abstract
Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
