Premise-based Multimodal Reasoning: Conditional Inference on Joint   Textual and Visual Clues

Qingxiu Dong; Ziwei Qin; Heming Xia; Tian Feng; Shoujie Tong; Haoran; Meng; Lin Xu; Weidong Zhan; Sujian Li; Zhongyu Wei; Tianyu Liu; Zuifang; Sui

arXiv:2105.07122·cs.CL·March 18, 2022

Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues

Qingxiu Dong, Ziwei Qin, Heming Xia, Tian Feng, Shoujie Tong, Haoran, Meng, Lin Xu, Weidong Zhan, Sujian Li, Zhongyu Wei, Tianyu Liu, Zuifang, Sui

PDF

Open Access

TL;DR

This paper introduces Premise-based Multi-modal Reasoning (PMR), a new dataset and task for reasoning with a textual premise and visual clues, challenging existing models to infer hypotheses from combined textual and visual information.

Contribution

The paper proposes the PMR task and dataset, incorporating premise-based reasoning with high-quality annotations and adversarial samples, advancing multimodal inference research.

Findings

01

State-of-the-art models show room for improvement on PMR.

02

PMR dataset contains 15,360 samples with high-quality annotations.

03

Adversarial samples help reduce annotation artifacts.

Abstract

It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query. In this work, we take a sober look at such an unconditional formulation in the sense that no prior knowledge is specified with respect to the source image(s). Inspired by the designs of both visual commonsense reasoning and natural language inference tasks, we propose a new task termed Premise-based Multi-modal Reasoning(PMR) where a textual premise is the background presumption on each source image. The PMR dataset contains 15,360 manually annotated samples which are created by a multi-phase crowd-sourcing process. With selected high-quality movie screenshots and human-curated premise templates from 6 pre-defined categories, we ask crowd-source workers to write one true hypothesis and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsApproximate Bayesian Computation