Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
Maximilian Triebel, Marco Menner, Dominik Helfenstein

TL;DR
This paper introduces VLATIM, a new benchmark for evaluating vision-language models on complex physics puzzles requiring logical reasoning and precise interactions, revealing current models' limitations.
Contribution
The paper presents VLATIM, a benchmark targeting the gap between logical reasoning and precise action execution in physics puzzle environments.
Findings
Large models excel in planning but lack visual grounding.
Models struggle with precise mouse interactions.
Current models do not exhibit human-like problem-solving.
Abstract
Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
