Multimodal Logical Inference System for Visual-Textual Entailment
Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima, Daisuke, Bekki

TL;DR
This paper introduces an unsupervised multimodal logical inference system that uses logic-based representations to determine entailment between visual and textual data, effectively handling complex semantic structures.
Contribution
It presents a novel approach combining semantic parsing and theorem proving for multimodal inference, unifying text and image understanding through logic-based representations.
Findings
Effective proof of entailment relations between images and text
Handles semantically complex sentences in visual-textual inference
Unsupervised approach reduces reliance on labeled data
Abstract
A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
