VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and   Proprioception

Zhaoliang Wan; Yonggen Ling; Senlin Yi; Lu Qi; Wangwei Lee; Minglei; Lu; Sicheng Yang; Xiao Teng; Peng Lu; Xu Yang; Ming-Hsuan Yang; Hui Cheng

arXiv:2501.00510·cs.RO·January 7, 2025

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

Zhaoliang Wan, Yonggen Ling, Senlin Yi, Lu Qi, Wangwei Lee, Minglei, Lu, Sicheng Yang, Xiao Teng, Peng Lu, Xu Yang, Ming-Hsuan Yang, Hui Cheng

PDF

Open Access

TL;DR

VinT-6D introduces a large-scale, multi-modal dataset combining vision, touch, and proprioception for robotic object-in-hand pose estimation, facilitating improved manipulation models and bridging the simulation-to-real gap.

Contribution

The paper presents VinT-6D, the first extensive multi-modal dataset for robotic manipulation, including the largest real-world subset, and a benchmark method demonstrating multi-modal fusion benefits.

Findings

01

VinT-6D contains 2 million simulated and 0.1 million real-world data points.

02

A benchmark method with multi-modal fusion significantly improves pose estimation performance.

03

VinT-6D bridges the simulation-to-real gap in robotic perception datasets.

Abstract

This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Video Surveillance and Tracking Methods