ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng

TL;DR
ConTextual introduces a new dataset for evaluating context-sensitive, text-rich visual reasoning in multimodal models, revealing significant performance gaps and challenges in understanding complex visual and temporal information.
Contribution
The paper presents ConTextual, a novel dataset for benchmarking multimodal models on context-sensitive reasoning with text-rich images, and provides comprehensive evaluation and analysis of current models' capabilities.
Findings
GPT-4V outperforms other models but lags 30.8% behind human performance.
Models struggle with time-related data and infographics.
Models perform well on abstract visual contexts like memes and quotes.
Abstract
Many real-world tasks require an agent to reason jointly over text and visual objects, (e.g., navigating in public spaces), which we refer to as context-sensitive text-rich visual reasoning. Specifically, these tasks require an understanding of the context in which the text interacts with visual elements within an image. However, there is a lack of existing datasets to benchmark the state-of-the-art multimodal models' capability on context-sensitive text-rich visual reasoning. In this paper, we introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision, LLaVA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a significant performance gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Residual Connection · Dropout · Byte Pair Encoding · Adam · Label Smoothing · Linear Layer · Multi-Head Attention · Softmax · Dense Connections
