User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
Mrinal Verghese, Brian Chen, Hamid Eghbalzadeh, Tushar Nagarajan, Ruta, Desai

TL;DR
This study evaluates multimodal large language models for activity assistance, focusing on visual grounding, activity forecasting, and user-in-the-loop replanning through offline benchmarks and a pioneering user study.
Contribution
It introduces a comprehensive evaluation framework combining offline benchmarks and the first user-in-the-loop study for multimodal LLMs in activity assistance.
Findings
Socratic Models outperform VCLMs in offline and online tasks.
Grounding long visual history remains challenging for current models.
Offline metrics do not reliably predict online performance.
Abstract
Our research investigates the capability of modern multimodal reasoning models, powered by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such assistants must be able to 1) encode relevant visual history from the assistant's sensors, e.g., camera, 2) forecast future actions for accomplishing the activity, and 3) replan based on the user in the loop. To evaluate the first two capabilities, grounding visual history and forecasting in short and long horizons, we conduct benchmarking of two prominent classes of multimodal LLM approaches -- Socratic Models and Vision Conditioned Language Models (VCLMs) on video-based action anticipation tasks using offline datasets. These offline benchmarks, however, do not allow us to close the loop with the user, which is essential to evaluate the replanning capabilities and measure successful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems
MethodsAdaptive Richard's Curve Weighted Activation
