HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Guimin Hu, Daniel Hershcovich, Hasti Seifi

TL;DR
HapticLLaMA is a novel multimodal language model that interprets vibration signals into descriptive language, advancing haptic captioning for virtual reality, accessibility, and rehabilitation, with improved human-aligned performance.
Contribution
The paper introduces HapticLLaMA, the first large language model designed for haptic captioning, integrating novel haptic tokenizers and reinforcement learning for better sensory understanding.
Findings
Achieved METEOR score of 59.98 and BLEU-4 score of 32.06.
Over 61% of captions rated above 3.5 by humans.
RLHF improved human rating distribution by 10%.
Abstract
Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Subtitles and Audiovisual Media · Hand Gesture Recognition Systems
