ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

Xin Wu; Zhixuan Liang; Yue Ma; Mengkang Hu; Zhiyuan Qin; and Xiu Li

arXiv:2602.08392·cs.RO·April 7, 2026

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, and Xiu Li

PDF

1 Repo

TL;DR

ST-BiBench introduces a multi-tier framework to evaluate and analyze the challenges of multimodal coordination in embodied AI, revealing gaps between strategic reasoning and physical execution in state-of-the-art models.

Contribution

It presents a comprehensive benchmarking platform for assessing multi-stream multimodal coordination, highlighting key bottlenecks and the persistent coordination paradox in MLLMs.

Findings

01

State-of-the-art MLLMs excel at strategic reasoning but struggle with perception-logic alignment.

02

There is a significant gap between high-level planning and fine-grained physical execution.

03

Multimodal fusion often suffers from interference and disconnection issues.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced the landscape of embodied AI, yet transitioning to synchronized bimanual coordination introduces formidable challenges in multi-stream multimodal integration. We introduce ST-BiBench, a comprehensive multi-tier framework for evaluating spatio-temporal multimodal coordination. Our approach centers on Strategic Coordination Planning, assessing high-level cross-modal reasoning over multiple action and perception streams. To investigate the "proximity paradox"-where semantically coherent plans fail to align with spatially grounded visual inputs-we incorporate Foundational Spatial Grounding to verify workspace awareness and arm-selection logic. Furthermore, we probe model frontiers through Fine-Grained Action Control, investigating whether MLLMs can directly synthesize high-dimensional continuous action modalities (16-Dim)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bimanibench/BiManiBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.