SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, and Yong Man Ro

TL;DR
SPARK introduces a comprehensive benchmark for evaluating large-scale vision-language models on multi-vision sensor perception and reasoning, highlighting current models' deficiencies in understanding diverse sensor data and physical environment context.
Contribution
The paper presents the SPARK benchmark with 6,248 test samples to evaluate LVLMs' multi-vision sensor perception and reasoning capabilities, addressing the gap in physical sensor information understanding.
Findings
Most models show deficiencies in multi-vision sensory reasoning.
Evaluation across different sensor types reveals varied model performance.
The benchmark enables systematic assessment of sensor-related understanding in LVLMs.
Abstract
Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications
