Explainable Semantic Space by Grounding Language to Vision with   Cross-Modal Contrastive Learning

Yizhen Zhang; Minkyu Choi; Kuan Han; Zhongming Liu

arXiv:2111.07180·cs.CL·November 16, 2021·1 cites

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Yizhen Zhang, Minkyu Choi, Kuan Han, Zhongming Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces a vision-grounded language model that learns semantic representations aligned with visual perception, enabling explainable, perceptually meaningful embeddings and improved multimodal understanding.

Contribution

It presents a novel two-stream model with cross-modal contrastive learning that grounds language in vision, producing interpretable semantic spaces and enhanced multimodal capabilities.

Findings

01

Semantic space aligns with human intuition

02

Word embeddings predict semantic norms

03

Enables multimodal image and text search

Abstract

In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the distributional semantics but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes grounded semantics for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques