Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense
Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang and, Shih-Fu Chang

TL;DR
This paper introduces a multimodal evaluation pipeline for visual commonsense understanding, revealing insights into model comprehension and demonstrating that training with generated data improves performance on standard benchmarks.
Contribution
It presents a novel automatic evaluation method for visual commonsense, and shows that training with generated data enhances model performance and understanding.
Findings
Visual information is underutilized compared to text.
Low-level semantic info aids high-level understanding.
Training with ME data improves VCR benchmark performance.
Abstract
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsTest
