Like a bilingual baby: The advantage of visually grounding a bilingual language model
Khai-Nguyen Nguyen, Zixin Tang, Ankur Mali, Alex Kelly

TL;DR
This paper demonstrates that visually grounding a bilingual language model enhances semantic understanding and perplexity, especially for concrete words, highlighting the importance of multi-sensory data in multilingual NLP.
Contribution
It introduces a visually grounded bilingual language model trained on English and Spanish image-caption data, showing improvements over traditional models in semantic similarity and perplexity.
Findings
Visual grounding improves semantic similarity within and across languages.
Grounded models show lower perplexity than non-grounded models.
No significant benefit observed for abstract words.
Abstract
Unlike most neural language models, humans learn language in a rich, multi-sensory and, often, multi-lingual environment. Current language models typically fail to fully capture the complexities of multilingual language use. We train an LSTM language model on images and captions in English and Spanish from MS-COCO-ES. We find that the visual grounding improves the model's understanding of semantic similarity both within and across languages and improves perplexity. However, we find no significant advantage of visual grounding for abstract words. Our results provide additional evidence of the advantages of visually grounded language models and point to the need for more naturalistic language data from multilingual speakers and multilingual datasets with perceptual grounding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
