MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri; Lan Wei; Jian-Qing Zheng; Dandan Zhang

arXiv:2602.19348·cs.CV·February 24, 2026

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang

PDF

Open Access 1 Models

TL;DR

MultiDiffSense is a diffusion-based model that generates multi-modal visuo-tactile images conditioned on object shape and contact pose, improving data efficiency and controllability for robotic tactile sensing.

Contribution

It introduces a unified diffusion model for multi-modal visuo-tactile image synthesis conditioned on pose and sensor type, enabling scalable data generation.

Findings

01

Outperforms baseline in SSIM metrics across sensors.

02

Mixing synthetic and real data reduces real data needs for pose estimation.

03

Enables controllable, physically consistent multi-modal image synthesis.

Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
sirine16/MultiDiffSense
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Sensor and Energy Harvesting Materials · Robot Manipulation and Learning · Tactile and Sensory Interactions