ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng, Liu, Ge Zhang

TL;DR
This paper introduces ING-VP, a novel interactive game-based benchmark designed to evaluate the spatial reasoning and multi-step planning abilities of multimodal large language models, revealing current models' significant limitations in these areas.
Contribution
The paper presents ING-VP, the first specialized benchmark for assessing spatial imagination and multi-step reasoning in MLLMs through interactive vision-based games.
Findings
State-of-the-art MLLMs perform poorly on ING-VP, with the best model achieving only 3.37% accuracy.
The benchmark reveals significant gaps in models' multi-step spatial reasoning capabilities.
Multiple experimental settings provide detailed insights into models' reasoning strengths and weaknesses.
Abstract
As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper has a good dataset contribution, the new dataset can be one part of the benchmark groups for vision-language models. 2. The idea of this paper is quite clear and easy to follow. We all agree that MLLMs and Vision Language Model can not solve many vision tasks and it is useful to mention they are struggling with these visual interactive games. 3. This paper tests a large group of open-sourced and closed-sourced models.
1. The motivation of your paper is limited, as it has been proposed by many papers working on MLLMs for visual games and puzzle games, such as [1],[2],[3] (And if you search key words [game, puzzle, VQA, VLM] in some database, you can find there are more works in this topic). I agree your benchmark may have some advantages (like grouping some tasks together) and may be better than these papers in different aspects, you should make a comparison and highlight these differences. And also I don't th
1. This paper presents an interesting approach by using visual games to evaluate the multi-step reasoning and spatial awareness abilities of MLLMs. 2. This paper provides an insightful analysis of the reasons why MLLMs frequently fail in these games. 3. The paper is well-structured and easy to follow.
1. The paper does not clearly differentiate itself from previous work. For instance, some studies evaluate MLLMs in agent-based environments, such as GUI and robotics, assessing multi-step reasoning and spatial abilities. 2. One of my concerns is the limited scope of the scenarios. The six visual games are pretty specific, and the perceptual or reasoning abilities they test may not adequately represent MLLMs' capabilities in a broader range of real-world scenarios. 3. I noticed the paper does no
1) The paper addresses the important problem of long-term reasoning and multistep planning tasks for multimodal LLMs. They have described several multistep reasoning tasks based on different games. 2) The chosen game environments provide a good foundation for evaluating these capabilities of the MLLs. 3) The authors evaluated several state-of-the-art models, both open-source and closed-source, revealing that MLLMs still perform below expectations across these tasks. This research is significant
Some additional experiments that will improve the quality of the paper: 1) The authors appear to use either image-only or text-only inputs for the model. Conducting experiments with both image and text as inputs could further enrich the analysis and provide insights into the model's ability to handle multimodal information effectively. It would be good to include such experiments. 2) For the text-only setup, did you use a language-only model, or did you use a vision-language model (VLM) withou
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems
