Fine-Grained Visual Entailment

Christopher Thomas; Yipeng Zhang; Shih-Fu Chang

arXiv:2203.15704·cs.CV·March 30, 2022

Fine-Grained Visual Entailment

Christopher Thomas, Yipeng Zhang, Shih-Fu Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a fine-grained visual entailment task that predicts detailed knowledge element relationships to images, using a novel explainable multi-instance learning approach with semantic constraints, achieving 68.18% accuracy.

Contribution

It proposes the first fine-grained visual entailment framework with explainability and a new multi-instance learning method that operates with only sample-level supervision.

Findings

01

Achieved 68.18% accuracy on the new dataset.

02

Outperformed several strong baseline models.

03

Provided extensive qualitative analysis of predictions.

Abstract

Visual entailment is a recently proposed multimodal reasoning task where the goal is to predict the logical relationship of a piece of text to an image. In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image. Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity. Because we lack fine-grained labels to train our method, we propose a novel multi-instance learning approach which learns a fine-grained labeling using only sample-level supervision. We also impose novel semantic structural constraints which ensure that fine-grained predictions are internally semantically consistent. We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

skrighyz/fgve
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques