Understanding ME? Multimodal Evaluation for Fine-grained Visual   Commonsense

Zhecan Wang; Haoxuan You; Yicheng He; Wenhao Li; Kai-Wei Chang and; Shih-Fu Chang

arXiv:2211.05895·cs.CV·October 24, 2023

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang and, Shih-Fu Chang

PDF

Open Access

TL;DR

This paper introduces a multimodal evaluation pipeline for visual commonsense understanding, revealing insights into model comprehension and demonstrating that training with generated data improves performance on standard benchmarks.

Contribution

It presents a novel automatic evaluation method for visual commonsense, and shows that training with generated data enhances model performance and understanding.

Findings

01

Visual information is underutilized compared to text.

02

Low-level semantic info aids high-level understanding.

03

Training with ME data improves VCR benchmark performance.

Abstract

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsTest