World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
Ziqiao Ma, Jiayi Pan, Joyce Chai

TL;DR
This paper introduces OctoBERT, a visually-grounded language model that learns grounded word meanings rapidly and robustly, demonstrating improved open vocabulary acquisition through grounding and fast mapping in vision-language models.
Contribution
It proposes OctoBERT, a novel pre-trained vision-language model that enhances grounded word learning and bootstrapping of unseen words in open-world settings.
Findings
OctoBERT outperforms baselines in grounded word learning.
Grounding during pre-training accelerates unseen word acquisition.
OctoBERT demonstrates more coherent and rapid open vocabulary learning.
Abstract
The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Weight Decay · Residual Connection · Softmax · Adam · Dropout
