VLH: Vision-Language-Haptics Foundation Model

Luis Francisco Moreno Fuentes; Muhammad Haris Khan; Miguel Altamirano Cabrera; Valerii Serpiva; Dmitri Iarchuk; Yara Mahmoud; Issatay Tokmurziyev; Dzmitry Tsetserukou

arXiv:2508.01361·cs.RO·August 5, 2025

VLH: Vision-Language-Haptics Foundation Model

Luis Francisco Moreno Fuentes, Muhammad Haris Khan, Miguel Altamirano Cabrera, Valerii Serpiva, Dmitri Iarchuk, Yara Mahmoud, Issatay Tokmurziyev, Dzmitry Tsetserukou

PDF

Open Access

TL;DR

VLH introduces a unified vision-language-haptic foundation model for aerial robotics and virtual reality, enabling real-time, context-aware tactile feedback driven by visual and language understanding.

Contribution

The paper presents VLH, a novel multimodal foundation model integrating perception, language, and haptics for immersive human-robot interaction in aerial and virtual environments.

Findings

01

Achieved 56.7% success rate in target acquisition during human-robot interaction.

02

Demonstrated 100% accuracy in texture discrimination tasks.

03

Generalization to novel tasks with up to 70% performance in visual scenarios.

Abstract

We present VLH, a novel Visual-Language-Haptic Foundation Model that unifies perception, language, and tactile feedback in aerial robotics and virtual reality. Unlike prior work that treats haptics as a secondary, reactive channel, VLH synthesizes mid-air force and vibration cues as a direct consequence of contextual visual understanding and natural language commands. Our platform comprises an 8-inch quadcopter equipped with dual inverse five-bar linkage arrays for localized haptic actuation, an egocentric VR camera, and an exocentric top-down view. Visual inputs and language instructions are processed by a fine-tuned OpenVLA backbone - adapted via LoRA on a bespoke dataset of 450 multimodal scenarios - to output a 7-dimensional action vector (Vx, Vy, Vz, Hx, Hy, Hz, Hv). INT8 quantization and a high-performance server ensure real-time operation at 4-5 Hz. In human-robot interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Tactile and Sensory Interactions · Hand Gesture Recognition Systems