FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John, Collomosse, Scott Cohen, Jiebo Luo

TL;DR
This paper introduces FineMatch, a benchmark for aspect-based fine-grained image-text mismatch detection and correction, along with an evaluation metric, to improve multimodal models' compositional understanding.
Contribution
The paper proposes a new benchmark and metric for fine-grained image-text mismatch detection and correction, enhancing evaluation of vision-language models' compositionality.
Findings
Models trained on FineMatch show improved mismatch detection.
Strong in-context learning models are less effective at fine-grained analysis.
FineMatch enables systems for hallucination detection in text-to-image generation.
Abstract
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
