ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in   Large Multimodal Models

Rohan Wadhawan; Hritik Bansal; Kai-Wei Chang; Nanyun Peng

arXiv:2401.13311·cs.CV·July 30, 2024·1 cites

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng

PDF

Open Access 1 Repo 2 Datasets

TL;DR

ConTextual introduces a new dataset for evaluating context-sensitive, text-rich visual reasoning in multimodal models, revealing significant performance gaps and challenges in understanding complex visual and temporal information.

Contribution

The paper presents ConTextual, a novel dataset for benchmarking multimodal models on context-sensitive reasoning with text-rich images, and provides comprehensive evaluation and analysis of current models' capabilities.

Findings

01

GPT-4V outperforms other models but lags 30.8% behind human performance.

02

Models struggle with time-related data and infographics.

03

Models perform well on abstract visual contexts like memes and quotes.

Abstract

Many real-world tasks require an agent to reason jointly over text and visual objects, (e.g., navigating in public spaces), which we refer to as context-sensitive text-rich visual reasoning. Specifically, these tasks require an understanding of the context in which the text interacts with visual elements within an image. However, there is a lack of existing datasets to benchmark the state-of-the-art multimodal models' capability on context-sensitive text-rich visual reasoning. In this paper, we introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision, LLaVA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a significant performance gap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rohan598/contextual
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Residual Connection · Dropout · Byte Pair Encoding · Adam · Label Smoothing · Linear Layer · Multi-Head Attention · Softmax · Dense Connections