TL;DR
ST-BiBench introduces a multi-tier framework to evaluate and analyze the challenges of multimodal coordination in embodied AI, revealing gaps between strategic reasoning and physical execution in state-of-the-art models.
Contribution
It presents a comprehensive benchmarking platform for assessing multi-stream multimodal coordination, highlighting key bottlenecks and the persistent coordination paradox in MLLMs.
Findings
State-of-the-art MLLMs excel at strategic reasoning but struggle with perception-logic alignment.
There is a significant gap between high-level planning and fine-grained physical execution.
Multimodal fusion often suffers from interference and disconnection issues.
Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced the landscape of embodied AI, yet transitioning to synchronized bimanual coordination introduces formidable challenges in multi-stream multimodal integration. We introduce ST-BiBench, a comprehensive multi-tier framework for evaluating spatio-temporal multimodal coordination. Our approach centers on Strategic Coordination Planning, assessing high-level cross-modal reasoning over multiple action and perception streams. To investigate the "proximity paradox"-where semantically coherent plans fail to align with spatially grounded visual inputs-we incorporate Foundational Spatial Grounding to verify workspace awareness and arm-selection logic. Furthermore, we probe model frontiers through Fine-Grained Action Control, investigating whether MLLMs can directly synthesize high-dimensional continuous action modalities (16-Dim)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
