FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection   and Correction

Hang Hua; Jing Shi; Kushal Kafle; Simon Jenni; Daoan Zhang; John; Collomosse; Scott Cohen; Jiebo Luo

arXiv:2404.14715·cs.CV·July 23, 2024

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John, Collomosse, Scott Cohen, Jiebo Luo

PDF

Open Access

TL;DR

This paper introduces FineMatch, a benchmark for aspect-based fine-grained image-text mismatch detection and correction, along with an evaluation metric, to improve multimodal models' compositional understanding.

Contribution

The paper proposes a new benchmark and metric for fine-grained image-text mismatch detection and correction, enhancing evaluation of vision-language models' compositionality.

Findings

01

Models trained on FineMatch show improved mismatch detection.

02

Strong in-context learning models are less effective at fine-grained analysis.

03

FineMatch enables systems for hallucination detection in text-to-image generation.

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques