Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition

Jae Young Choi; Seon Gyeom Kim; Hyungjun Yoon; Taeckyung Lee; Donggun Lee; Jaeryung Chung; Jihyung Kil; Ryan Rossi; Sung-Ju Lee; Tak Yeon Lee

arXiv:2604.09585·cs.HC·April 14, 2026

Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition

Jae Young Choi, Seon Gyeom Kim, Hyungjun Yoon, Taeckyung Lee, Donggun Lee, Jaeryung Chung, Jihyung Kil, Ryan Rossi, Sung-Ju Lee, Tak Yeon Lee

PDF

TL;DR

This paper explores transforming eye-tracking sensor data into visualization images to improve human activity recognition with multimodal LLMs, offering a scalable and token-efficient approach.

Contribution

It introduces a visual prompting strategy converting eye-tracking data into images for MLLMs, systematically evaluated across datasets and visualization types.

Findings

01

Visual prompting enhances token efficiency for eye-tracking data.

02

Different visualization types impact recognition performance.

03

The approach enables MLLMs to reason over high-frequency sensor signals.

Abstract

Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.