By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting
Hyungjun Yoon, Biniyam Aschalew Tolera, Taesik Gong, Kimin Lee,, Sung-Ju Lee

TL;DR
This paper introduces a visual prompting method for multimodal large language models to improve performance and efficiency in sensory data tasks by visualizing sensor data and automating visualization creation.
Contribution
It presents a novel visual prompting approach and an automated visualization generator that enhance LLMs' handling of sensor data without task-specific prior knowledge.
Findings
10% higher accuracy over text prompts
Reduced token costs by 15.8 times
Effective across nine sensory tasks and four modalities
Abstract
Large language models (LLMs) have demonstrated exceptional abilities across various domains. However, utilizing LLMs for ubiquitous sensing applications remains challenging as existing text-prompt methods show significant performance degradation when handling long sensor data sequences. We propose a visual prompting approach for sensor data using multimodal LLMs (MLLMs). We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge. We evaluated our approach on nine sensory tasks involving four sensing modalities, achieving an average of 10% higher accuracy than text-based prompts and reducing token costs by 15.8 times. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
