AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Sixiang Chen; Jiaming Liu; Siyuan Qian; Han Jiang; Lily Li; Renrui Zhang; Zhuoyang Liu; Chenyang Gu; Chengkai Hou; Pengwei Wang; Zhongyuan Wang; Shanghang Zhang

arXiv:2507.01961·cs.RO·July 8, 2025

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

PDF

Open Access

TL;DR

AC-DiT introduces an adaptive transformer model that improves mobile manipulation by explicitly modeling base-manipulator coordination and dynamically fusing multimodal visual data based on task stage, enhancing robotic control.

Contribution

The paper proposes AC-DiT, a novel transformer-based framework that explicitly models mobile base influence and adaptively fuses multimodal perception for improved mobile manipulation.

Findings

01

Enhanced coordination between base and manipulator.

02

Dynamic multimodal perception improves task performance.

03

Validated on simulated and real-world tasks.

Abstract

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · Diffusion · Balanced Selection