How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing
Shutong Jin, Ruiyu Wang, Muhammad Zahid, Florian T. Pokorny

TL;DR
This paper investigates how physics attributes and background scene characteristics affect the performance of Video Transformers in robotic planar pushing, using a new large dataset and a modular prediction framework.
Contribution
It introduces CloudGripper-Push-1K, a large real-world dataset, and proposes the Video Occlusion Transformer (VOT) framework for trajectory prediction in robotic manipulation.
Findings
Physics and background attributes significantly impact model performance.
Certain attribute changes are more detrimental to generalization.
Minimal fine-tuning data can adapt models to new scenarios.
Abstract
As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Advanced Vision and Imaging
MethodsMulti-Head Attention · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Attention Is All You Need · Adam · Residual Connection · Layer Normalization · Softmax
