VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

TL;DR
VL-KGE leverages vision-language models to create unified, multimodal embeddings for knowledge graphs, significantly improving link prediction by better aligning diverse modalities.
Contribution
This paper introduces VL-KGE, a novel framework that combines VLMs with structured relational modeling for enhanced multimodal knowledge graph embeddings.
Findings
VL-KGE outperforms traditional KGE methods in link prediction.
Experiments on WN9-IMG and new art MKGs show improved multimodal reasoning.
VLMs effectively align diverse modalities within knowledge graphs.
Abstract
Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Multimodal Machine Learning Applications · Topic Modeling
