Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking
Sangyun Chung, Youngjoon Yu, Se Yeon Kim, Youngchae Chee, and Yong Man Ro

TL;DR
This paper introduces a cost-efficient method and a new benchmark to improve and evaluate vision-language models' understanding of diverse sensor data beyond RGB images, without extensive retraining.
Contribution
The paper proposes Sensor-Aware Attributes Fine-Tuning (SAFT) with DNA optimization and introduces VS-TDX, a benchmark for sensor-specific understanding in VLMs.
Findings
SAFT improves non-RGB sensor understanding with minimal data
VS-TDX effectively evaluates sensor-specific VLM performance
Method outperforms existing approaches in resource-constrained scenarios
Abstract
Large-scale Vision-Language Models (VLMs) have achieved notable progress in aligning visual inputs with text. However, their ability to deeply understand the unique physical properties of non-RGB vision sensor images remains limited. In this paper, we revisit and analyze these limitations and introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding-without requiring extensive training data or any modifications to the existing VLM architectures. Specifically, we propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization, which leverages minimal sensor-specific data to enable robust learning of non-RGB characteristics and overcome RGB-centric biases inherent in current VLMs. In addition, we present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs' sensor-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
