Mementos: A Comprehensive Benchmark for Multimodal Large Language Model   Reasoning over Image Sequences

Xiyao Wang; Yuhang Zhou; Xiaoyu Liu; Hongjin Lu; Yuancheng Xu; Feihong; He; Jaehong Yoon; Taixi Lu; Gedas Bertasius; Mohit Bansal; Huaxiu Yao; Furong; Huang

arXiv:2401.10529·cs.CV·January 26, 2024·1 cites

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong, He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong, Huang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Mementos introduces a new benchmark with 4,761 image sequences to evaluate multimodal large language models' ability to reason over dynamic visual information, revealing current models' struggles with sequential image understanding.

Contribution

This paper presents Mementos, the first comprehensive benchmark for assessing MLLMs' reasoning over image sequences, highlighting their limitations and influencing factors.

Findings

01

MLLMs often hallucinate or misrepresent objects and behaviors in sequences.

02

GPT-4V and Gemini show significant challenges in dynamic scene understanding.

03

Key factors include object-behavior correlation and co-occurring behaviors.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umd-huang-lab/mementos
noneOfficial

Datasets

furonghuang-lab/Mementos
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Residual Connection · Dropout · Linear Layer · Multi-Head Attention · Byte Pair Encoding