VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems
Ethan Smyth, Alessandro Suglia

TL;DR
VoyagerVision explores the integration of visual inputs into open-ended learning systems, demonstrating that multi-modal data enhances the ability to interpret environments and generate complex structures in Minecraft, advancing towards more capable AGI models.
Contribution
This paper introduces VoyagerVision, a novel multi-modal model that uses visual feedback to improve open-ended task performance in environment construction, extending previous models like Voyager.
Findings
VoyagerVision created an average of 2.75 structures within fifty iterations.
It succeeded in half of the building unit tests in flat worlds.
Most failures occurred in complex structures.
Abstract
Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent's POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision -- a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Artificial Intelligence Applications
