How Physics and Background Attributes Impact Video Transformers in   Robotic Manipulation: A Case Study on Planar Pushing

Shutong Jin; Ruiyu Wang; Muhammad Zahid; Florian T. Pokorny

arXiv:2310.02044·cs.RO·August 29, 2024

How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

Shutong Jin, Ruiyu Wang, Muhammad Zahid, Florian T. Pokorny

PDF

Open Access

TL;DR

This paper investigates how physics attributes and background scene characteristics affect the performance of Video Transformers in robotic planar pushing, using a new large dataset and a modular prediction framework.

Contribution

It introduces CloudGripper-Push-1K, a large real-world dataset, and proposes the Video Occlusion Transformer (VOT) framework for trajectory prediction in robotic manipulation.

Findings

01

Physics and background attributes significantly impact model performance.

02

Certain attribute changes are more detrimental to generalization.

03

Minimal fine-tuning data can adapt models to new scenarios.

Abstract

As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Advanced Vision and Imaging

MethodsMulti-Head Attention · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Attention Is All You Need · Adam · Residual Connection · Layer Normalization · Softmax