By My Eyes: Grounding Multimodal Large Language Models with Sensor Data   via Visual Prompting

Hyungjun Yoon; Biniyam Aschalew Tolera; Taesik Gong; Kimin Lee,; Sung-Ju Lee

arXiv:2407.10385·cs.CL·October 1, 2024

By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting

Hyungjun Yoon, Biniyam Aschalew Tolera, Taesik Gong, Kimin Lee,, Sung-Ju Lee

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a visual prompting method for multimodal large language models to improve performance and efficiency in sensory data tasks by visualizing sensor data and automating visualization creation.

Contribution

It presents a novel visual prompting approach and an automated visualization generator that enhance LLMs' handling of sensor data without task-specific prior knowledge.

Findings

01

10% higher accuracy over text prompts

02

Reduced token costs by 15.8 times

03

Effective across nine sensory tasks and four modalities

Abstract

Large language models (LLMs) have demonstrated exceptional abilities across various domains. However, utilizing LLMs for ubiquitous sensing applications remains challenging as existing text-prompt methods show significant performance degradation when handling long sensor data sequences. We propose a visual prompting approach for sensor data using multimodal LLMs (MLLMs). We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge. We evaluated our approach on nine sensory tasks involving four sensing modalities, achieving an average of 10% higher accuracy than text-based prompts and reducing token costs by 15.8 times. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

diamond264/ByMyEyes
noneOfficial

Videos

By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques