Analogical Reasoning for Visually Grounded Language Acquisition
Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang

TL;DR
This paper introduces ARTNet, a multimodal transformer with analogical reasoning for visually grounded language acquisition, enabling AI to generalize and recognize novel compositions in videos more effectively.
Contribution
We propose a novel analogical reasoning mechanism integrated into a transformer model for improved language grounding in visual data.
Findings
ARTNet outperforms state-of-the-art models in recognition accuracy
The model demonstrates strong generalization to unseen compositions
Extensive experiments validate the effectiveness of the analogical reasoning approach
Abstract
Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of Visually grounded Language Acquisition (VLA). We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNet), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as "washing apple" or "cutting carrot", it can generalize and recognize new compositions in new video frames, such as "washing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Residual Connection · Adam · Multi-Head Attention · Dropout · Softmax · Label Smoothing
