AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

TL;DR
AC-DiT introduces an adaptive transformer model that improves mobile manipulation by explicitly modeling base-manipulator coordination and dynamically fusing multimodal visual data based on task stage, enhancing robotic control.
Contribution
The paper proposes AC-DiT, a novel transformer-based framework that explicitly models mobile base influence and adaptively fuses multimodal perception for improved mobile manipulation.
Findings
Enhanced coordination between base and manipulator.
Dynamic multimodal perception improves task performance.
Validated on simulated and real-world tasks.
Abstract
Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · Diffusion · Balanced Selection
