Fine-Grained Visual Entailment
Christopher Thomas, Yipeng Zhang, Shih-Fu Chang

TL;DR
This paper introduces a fine-grained visual entailment task that predicts detailed knowledge element relationships to images, using a novel explainable multi-instance learning approach with semantic constraints, achieving 68.18% accuracy.
Contribution
It proposes the first fine-grained visual entailment framework with explainability and a new multi-instance learning method that operates with only sample-level supervision.
Findings
Achieved 68.18% accuracy on the new dataset.
Outperformed several strong baseline models.
Provided extensive qualitative analysis of predictions.
Abstract
Visual entailment is a recently proposed multimodal reasoning task where the goal is to predict the logical relationship of a piece of text to an image. In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image. Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity. Because we lack fine-grained labels to train our method, we propose a novel multi-instance learning approach which learns a fine-grained labeling using only sample-level supervision. We also impose novel semantic structural constraints which ensure that fine-grained predictions are internally semantically consistent. We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
