SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for   Large-scale Vision-Language Models

Youngjoon Yu; Sangyun Chung; Byung-Kwan Lee; and Yong Man Ro

arXiv:2408.12114·cs.CV·October 14, 2024

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, and Yong Man Ro

PDF

Open Access 1 Repo 1 Datasets

TL;DR

SPARK introduces a comprehensive benchmark for evaluating large-scale vision-language models on multi-vision sensor perception and reasoning, highlighting current models' deficiencies in understanding diverse sensor data and physical environment context.

Contribution

The paper presents the SPARK benchmark with 6,248 test samples to evaluate LVLMs' multi-vision sensor perception and reasoning capabilities, addressing the gap in physical sensor information understanding.

Findings

01

Most models show deficiencies in multi-vision sensory reasoning.

02

Evaluation across different sensor types reveals varied model performance.

03

The benchmark enables systematic assessment of sensor-related understanding in LVLMs.

Abstract

Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

top-yun/spark
pytorchOfficial

Datasets

topyun/SPARK
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications