VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Hyeongcheol Park; Jiyoung Seo; MinHyuk Jang; Hogun Park; Ha Dam Baek; Gyusam Chang; Hyeonsoo Im; Sangpil Kim

arXiv:2506.21556·cs.CL·September 29, 2025

VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Hyeongcheol Park, Jiyoung Seo, MinHyuk Jang, Hogun Park, Ha Dam Baek, Gyusam Chang, Hyeonsoo Im, Sangpil Kim

PDF

Open Access 4 Reviews

TL;DR

VAT-KG is a comprehensive, concept-centric multimodal knowledge graph integrating visual, audio, and text data, designed to enhance retrieval-augmented generation and grounded reasoning in multimodal tasks.

Contribution

We introduce VAT-KG, the first multimodal knowledge graph covering multiple modalities with detailed concept descriptions, constructed through a novel alignment pipeline from diverse datasets.

Findings

01

VAT-KG improves question answering across modalities.

02

The multimodal RAG framework effectively retrieves detailed concept knowledge.

03

Experiments show VAT-KG enhances MLLMs' grounding and reasoning capabilities.

Abstract

Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations restrict applicability to multimodal tasks, particularly as recent MLLMs adopt richer modalities like video and audio. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. This paper introduces VAT-KG. It is the first concept-centric multimodal knowledge graph that simultaneously covers text, image, audio, and video modalities, providing a potentially influential data resource for future multimodal reasoning research involving video and audio modalities. 2. The four-stage construction pipeline for building VAT-KG is clearly described and reproducible, demonstrating strong system integration and engineering quality. It comprises multimodal alignment filtering,

Weaknesses

Seen in Questions.

Reviewer 02Rating 4Confidence 4

Strengths

1. The authors clearly identify a central limitation of existing MMKGs that they provide shallow or entity-centric structures and narrow modality coverage, which hinders their utility for retrieval-augmented generation with state-of-the-art MLLMs. VAT-KG’s ambition to unify video, image, audio, and text into a richly described, concept-level knowledge graph fills a visible gap and aligns with unmet needs in the community. 2. The paper introduces a comprehensive and well-justified pipeline that g

Weaknesses

1. I am very concerned about the fairness of the comparison in this article. Since the knowledge contained in VAT-KG is not universal but case-specific. As shown in the sample in Figure 2, the knowledge of an airplane flying over only appears in this picture, while existing MMKG methods are usually targeted at general knowledge retrieval scenarios. I think this comparison is not fair. 2. There have been many works on MMKG-related datasets in recent years. The characteristics of the VT-KG dataset

Reviewer 03Rating 4Confidence 4

Strengths

1. The motivation for building a joint visual-audio-text trimodal knowledge graph is meaningful and offers promising prospects for multimodal understanding tasks. 2. By incorporating knowledge from the constructed VAT-KG, the baseline models Video-LLaMA2 and Qwen2.5-Omni achieve modest performance improvements on video, audio, and audio-video understanding tasks.

Weaknesses

1. First, from the results in Table 2, the performance improvement of the proposed method on VCG and AVQA is marginal. In fact, I believe that datasets such as AudioCaps-QA, VCGPT, and AVQA, which rely more on common sense than knowledge, are not well suited for knowledge graph applications. The authors should focus more on knowledge-related video and audio understanding datasets to better leverage the power of knowledge graphs. In addition, a comprehensive comparison with existing state-of-the-

Reviewer 04Rating 8Confidence 4

Strengths

1. VAT-KG has achieved new height of knowledge coverage. 2. The construction pipeline is carefully designed, with very rigorous filtering rules to ensure high quality of the final outcome. 3. The experiments have shown consistent improvement.

Weaknesses

1. (Major) It seems that the entire construction pipeline focuses on refining the text modality, while the data of other modalities remain the same as what the initial corpus contain. Do the authors plan to have additional multimodal data crawled from the web? 2. (Minor) While this MMKG includes both image and video modalities, the paper uses "visual" to denote both of them early on. This might cause some confusion, making readers think there are only three modalities. 3. (Minor) There are two k

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks