MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation
Haoming Wang, Qiyao Xue, Weichen Liu, Wei Gao

TL;DR
MosaicThinker is a novel on-device inference technique that improves visual spatial reasoning in embodied AI by constructing a unified spatial map from multiple video frames, enabling better cross-frame reasoning.
Contribution
The paper introduces MosaicThinker, a new method that enhances small VLMs' spatial reasoning by integrating multi-frame spatial information into a global semantic map for embodied AI.
Findings
Significantly improves cross-frame spatial reasoning accuracy.
Effective on resource-constrained embodied AI devices.
Handles diverse and complex spatial reasoning tasks.
Abstract
When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Ferroelectric and Negative Capacitance Devices
