VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang

TL;DR
VistaWise is a cost-effective agent framework that leverages a cross-modal knowledge graph and minimal domain-specific data to enhance performance in Minecraft tasks, reducing development costs significantly.
Contribution
The paper introduces VistaWise, a novel framework integrating cross-modal knowledge graphs and a retrieval strategy, enabling high-performance Minecraft agents with minimal domain-specific training data.
Findings
Achieves state-of-the-art performance in open-world Minecraft tasks.
Reduces domain-specific training data requirements from millions to hundreds.
Demonstrates effective multimodal understanding and decision-making.
Abstract
Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
