Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Willow Mandil; Amir Ghalamzan-E

arXiv:2304.11193·cs.RO·May 14, 2026·1 cites

Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Willow Mandil, Amir Ghalamzan-E

PDF

1 Repo

TL;DR

This paper develops a multi-modal world model integrating visual and tactile data to improve robotic interaction predictions, especially under physical ambiguity, and introduces two new datasets for evaluation.

Contribution

It presents a novel visuo-tactile predictive system and two datasets, advancing understanding of physical interactions in robotics with multimodal data.

Findings

01

Visuo-tactile prediction enhances accuracy in ambiguous interactions.

02

Tactile data provides limited benefits when object dynamics are visually clear.

03

New datasets isolate physical ambiguity and mirror existing benchmarks.

Abstract

Predicting the outcomes of robotic actions, often referred to as learning a world model, in complex environments remains a fundamental challenge in robotics. Existing approaches primarily rely on visual observations and action inputs to generate video-based predictions, frequently overlooking the critical role of tactile feedback in understanding physical interactions. In this work, we investigate the integration of tactile and visual information within predictive perception systems for physical robot interaction. We demonstrate that visuo-tactile prediction provides the greatest benefits in physically ambiguous interaction regimes, while improvements are naturally limited when object dynamics are visually inferable. Furthermore, we introduce two novel robot-pushing datasets collected using a magnetic-based tactile sensor for unsupervised learning. The first dataset comprises visually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.