Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong, He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong, Huang

TL;DR
Mementos introduces a new benchmark with 4,761 image sequences to evaluate multimodal large language models' ability to reason over dynamic visual information, revealing current models' struggles with sequential image understanding.
Contribution
This paper presents Mementos, the first comprehensive benchmark for assessing MLLMs' reasoning over image sequences, highlighting their limitations and influencing factors.
Findings
MLLMs often hallucinate or misrepresent objects and behaviors in sequences.
GPT-4V and Gemini show significant challenges in dynamic scene understanding.
Key factors include object-behavior correlation and co-occurring behaviors.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Residual Connection · Dropout · Linear Layer · Multi-Head Attention · Byte Pair Encoding
