Grounded Language Acquisition From Object and Action Imagery
James Robert Kubricht, Zhaoyuan Yang, Jianwei Qiu, Peter, Henry Tu

TL;DR
This paper investigates how emergent language models trained on visual data can develop grounded symbols for object and action recognition, using referential and contrastive learning in visual tasks.
Contribution
It introduces a dual training approach for emergent language models in visual recognition, combining referential and contrastive learning with interpretability methods.
Findings
Symbols can be grounded in visual features using the proposed methods.
Gradient-based interpretability reveals semantic regions associated with symbols.
Embeddings show meaningful clustering related to object and action classes.
Abstract
Deep learning approaches to natural language processing have made great strides in recent years. While these models produce symbols that convey vast amounts of diverse knowledge, it is unclear how such symbols are grounded in data from the world. In this paper, we explore the development of a private language for visual data representation by training emergent language (EL) encoders/decoders in both i) a traditional referential game environment and ii) a contrastive learning environment utilizing a within-class matching training paradigm. An additional classification layer utilizing neural machine translation and random forest classification was used to transform symbolic representations (sequences of integer symbols) to class labels. These methods were applied in two experiments focusing on object recognition and action recognition. For object recognition, a set of sketches produced by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning
