Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya; Bicheng Xu; Sanjay Haresh; Reza Pourreza; Litian Liu; Sunny Panchal; Pulkit Madan; Leonid Sigal; Roland Memisevic

arXiv:2511.21998·cs.CV·April 14, 2026

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic

PDF

1 Video

TL;DR

This paper introduces a new benchmark and dataset for evaluating multi-modal LLMs in providing real-time, interactive task guidance with mistake detection and correction, focusing on live video-based instruction.

Contribution

It presents Qualcomm Interactive Cooking, a novel benchmark and dataset, and introduces LiveMamba, a streaming multi-modal LLM for interactive instructional guidance.

Findings

01

State-of-the-art models struggle with real-time guidance and mistake detection.

02

The benchmark enables evaluation of models in live, situated coaching scenarios.

03

LiveMamba serves as a strong baseline for future research.

Abstract

Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?· slideslive