VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception
Zhaoliang Wan, Yonggen Ling, Senlin Yi, Lu Qi, Wangwei Lee, Minglei, Lu, Sicheng Yang, Xiao Teng, Peng Lu, Xu Yang, Ming-Hsuan Yang, Hui Cheng

TL;DR
VinT-6D introduces a large-scale, multi-modal dataset combining vision, touch, and proprioception for robotic object-in-hand pose estimation, facilitating improved manipulation models and bridging the simulation-to-real gap.
Contribution
The paper presents VinT-6D, the first extensive multi-modal dataset for robotic manipulation, including the largest real-world subset, and a benchmark method demonstrating multi-modal fusion benefits.
Findings
VinT-6D contains 2 million simulated and 0.1 million real-world data points.
A benchmark method with multi-modal fusion significantly improves pose estimation performance.
VinT-6D bridges the simulation-to-real gap in robotic perception datasets.
Abstract
This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Video Surveillance and Tracking Methods
