Robotic grasp detection based on Transformer
Mingshuai Dong, Xiuli Yu

TL;DR
This paper introduces a Transformer-based encoder-decoder model for robotic grasp detection that effectively handles cluttered scenes and achieves high accuracy, combining global context extraction with convolutional inductive bias.
Contribution
The paper proposes a novel encoder-decoder grasp detection model that integrates Transformer and convolutional networks to improve performance in cluttered environments.
Findings
Outperforms existing methods in overlapping object scenes
Achieves 98.1% accuracy on the Cornell Grasp dataset
Demonstrates effectiveness of combining Transformer with CNNs
Abstract
Grasp detection in a cluttered environment is still a great challenge for robots. Currently, the Transformer mechanism has been successfully applied to visual tasks, and its excellent ability of global context information extraction provides a feasible way to improve the performance of robotic grasp detection in cluttered scenes. However, the insufficient inductive bias ability of the original Transformer model requires large-scale datasets training, which is difficult to obtain for grasp detection. In this paper, we propose a grasp detection model based on encoder-decoder structure. The encoder uses a Transformer network to extract global context information. The decoder uses a fully convolutional neural network to improve the inductive bias capability of the model and combine features extracted by the encoder to predict the final grasp configuration. Experiments on the VMRD dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Hand Gesture Recognition Systems · Human Pose and Action Recognition
