Multimodal Logical Inference System for Visual-Textual Entailment

Riko Suzuki; Hitomi Yanaka; Masashi Yoshikawa; Koji Mineshima; Daisuke; Bekki

arXiv:1906.03952·cs.CL·June 11, 2019·1 cites

Multimodal Logical Inference System for Visual-Textual Entailment

Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima, Daisuke, Bekki

PDF

Open Access

TL;DR

This paper introduces an unsupervised multimodal logical inference system that uses logic-based representations to determine entailment between visual and textual data, effectively handling complex semantic structures.

Contribution

It presents a novel approach combining semantic parsing and theorem proving for multimodal inference, unifying text and image understanding through logic-based representations.

Findings

01

Effective proof of entailment relations between images and text

02

Handles semantically complex sentences in visual-textual inference

03

Unsupervised approach reduces reliance on labeled data

Abstract

A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling