From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
Federico Toschi, Nicol\`o Brunello, Andrea Sassella, Vincenzo Scotti, Mark James Carman

TL;DR
This paper introduces a dataset linking instruction manuals with assembly videos to evaluate multimodal large language models' ability to assist in technical tasks, focusing on reasoning, step tracking, and manual referencing.
Contribution
It presents the Manual to Action Dataset (M2AD) for assessing MLMs in procedural understanding and introduces evaluation benchmarks for reasoning and referencing capabilities.
Findings
Some MLMs understand procedural sequences but are limited by architecture.
Performance is constrained by hardware and architectural factors.
Multi-image and interleaved text-image reasoning are needed for improvement.
Abstract
The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and leading to Multimodal Large Language Models (MLMs). Given the current adoption of LLM based assistants in solving technical or domain specific problems, the natural continuation of this trend is to extend the input domains of these assistants exploiting MLMs. Ideally, these MLMs should be used as real time assistants in procedural tasks, hopefully integrating a view of the environment where the user being assisted is, or even better sharing the same point of view via Virtual Reality (VR) or Augmented Reality (AR) supports, to reason over the same scenario the user is experiencing. With this work, we aim at evaluating the quality of currently openly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling
