RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, Lin Ma

TL;DR
RoboTron-Mani is a comprehensive multimodal large model for robotic manipulation that leverages a new dataset RoboData, enhancing 3D perception, modality fusion, and achieving state-of-the-art results across diverse tasks.
Contribution
The paper introduces RoboTron-Mani, a novel multimodal model with improved 3D perception and modality fusion, and RoboData, a comprehensive dataset integrating multiple robotic data sources.
Findings
Outperforms expert models on manipulation tasks.
Increases average sequence length on CALVIN from 1.7 to 3.5.
Achieves state-of-the-art results on simulated and real-world datasets.
Abstract
Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning
