Analogical Reasoning for Visually Grounded Language Acquisition

Bo Wu; Haoyu Qin; Alireza Zareian; Carl Vondrick; Shih-Fu Chang

arXiv:2007.11668·cs.CL·July 24, 2020·5 cites

Analogical Reasoning for Visually Grounded Language Acquisition

Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang

PDF

Open Access

TL;DR

This paper introduces ARTNet, a multimodal transformer with analogical reasoning for visually grounded language acquisition, enabling AI to generalize and recognize novel compositions in videos more effectively.

Contribution

We propose a novel analogical reasoning mechanism integrated into a transformer model for improved language grounding in visual data.

Findings

01

ARTNet outperforms state-of-the-art models in recognition accuracy

02

The model demonstrates strong generalization to unseen compositions

03

Extensive experiments validate the effectiveness of the analogical reasoning approach

Abstract

Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of Visually grounded Language Acquisition (VLA). We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNet), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as "washing apple" or "cutting carrot", it can generalize and recognize new compositions in new video frames, such as "washing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Residual Connection · Adam · Multi-Head Attention · Dropout · Softmax · Label Smoothing