RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality   Vision-Language Model

Hantao Zhou; Tianying Ji; Lukas Sommerhalder; Michael Goerner; Norman; Hendrich; Jianwei Zhang; Fuchun Sun; Huazhe Xu

arXiv:2406.10157·cs.RO·July 23, 2024·1 cites

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

Hantao Zhou, Tianying Ji, Lukas Sommerhalder, Michael Goerner, Norman, Hendrich, Jianwei Zhang, Fuchun Sun, Huazhe Xu

PDF

Open Access

TL;DR

RoboGolf is a novel framework that integrates vision-language models with multi-modal perception and reflective reasoning to master real-world minigolf tasks, demonstrating advanced embodied intelligence and spatial understanding.

Contribution

It introduces a VLM-based approach combining dual-camera perception, closed-loop action refinement, and a reflective equilibrium loop for real-world minigolf mastery.

Findings

01

Effective in offline inference with recorded trajectories

02

Combines perception, action refinement, and reflective reasoning

03

Demonstrates advanced embodied intelligence in real-world tasks

Abstract

Minigolf is an exemplary real-world game for examining embodied intelligence, requiring challenging spatial and kinodynamic understanding to putt the ball. Additionally, reflective reasoning is required if the feasibility of a challenge is not ensured. We introduce RoboGolf, a VLM-based framework that combines dual-camera perception with closed-loop action refinement, augmented by a reflective equilibrium loop. The core of both loops is powered by finetuned VLMs. We analyze the capabilities of the framework in an offline inference setting, relying on an extensive set of recorded trajectories. Exemplary demonstrations of the analyzed problem domain are available at https://jity16.github.io/RoboGolf/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAugmented Reality Applications

MethodsSparse Evolutionary Training