Visuo-Tactile World Models
Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, Franziska Meier

TL;DR
This paper presents VT-WM, a multi-task visuo-tactile world model that enhances robot-object interaction understanding by integrating tactile sensing with vision, leading to improved physical reasoning, planning, and adaptability in contact-rich tasks.
Contribution
Introduction of VT-WM, a novel multi-task visuo-tactile world model that captures contact physics, improving physical fidelity and planning in contact-rich manipulation tasks.
Findings
33% better object permanence in imagination
29% improved compliance with laws of motion
Up to 35% higher success rates in real-robot tasks
Abstract
We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step,…
Peer Reviews
Decision·Submitted to ICLR 2026
* Novelty: To the best of my knowledge, this paper presents one of the first visuo-tactile world models. * Thorough experiments: The paper includes experiments on video generation, robotic manipulation, and data efficiency. These comprehensive studies make the work quite solid.
* The explanation for causal compliance is not entirely satisfactory. As shown in Fig. 7, VT-WM appears to understand contact and predicts the cloth to be static. However, in theory, the model could also hallucinate incorrect tactile images and predict contact. Adding a tactile modality does not necessarily guarantee causal compliance. Based on Fig. 7, it is difficult to rule out cherry-picking, and the evidence does not fully establish causal compliance. For example, in Fig. 7, it seems plausib
The paper is well presented, with clear illustrations and a coherent narrative. It is easy to read and effectively conveys the authors’ key ideas. The results are clearly reported.
Minor weakness: 1. Additional visualization of the tactile signals would help readers grasp the Digit 360 modality. In Fig. 2, the four tactile images are difficult to distinguish. Consider including the object’s CAD model and/or overlays highlighting contact regions to clarify how the signals correlate with surface geometry. 2. L139–140: ‘For instance, when manipulating an object in-hand, touch provides context about forces, slip, and subtle pose changes.’ It’s true that humans can interpret
- The paper tackles an important problem of developing physically grounded world models using tactile sensing as an additional signal. - The experiments in the paper are nicely organized with an effort to assess statistical significance of the results. - The authors also show the ability to perform zero-shot planning using their proposed world model using CEM. - Further, the authors highlight the data efficiency of training models models for real world planning as opposed to specialized task-spe
Including both weaknesses as well as questions tied to the weaknesses below. - In Section 4.3, what if one trained a multitask BC policy instead of a task-specific policy. I am assuming the authors have actions available from the training data collected for training the world model. I am curious to see if multitask BC training exhibits similar data efficiency. - I would be curious to see the difference in performance if the world model that takes tactile readings as input but doesn’t predict ta
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Social Robot Interaction and HRI
