Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking

Sangyun Chung; Youngjoon Yu; Se Yeon Kim; Youngchae Chee; and Yong Man Ro

arXiv:2412.20750·cs.CV·August 4, 2025

Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking

Sangyun Chung, Youngjoon Yu, Se Yeon Kim, Youngchae Chee, and Yong Man Ro

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a cost-efficient method and a new benchmark to improve and evaluate vision-language models' understanding of diverse sensor data beyond RGB images, without extensive retraining.

Contribution

The paper proposes Sensor-Aware Attributes Fine-Tuning (SAFT) with DNA optimization and introduces VS-TDX, a benchmark for sensor-specific understanding in VLMs.

Findings

01

SAFT improves non-RGB sensor understanding with minimal data

02

VS-TDX effectively evaluates sensor-specific VLM performance

03

Method outperforms existing approaches in resource-constrained scenarios

Abstract

Large-scale Vision-Language Models (VLMs) have achieved notable progress in aligning visual inputs with text. However, their ability to deeply understand the unique physical properties of non-RGB vision sensor images remains limited. In this paper, we revisit and analyze these limitations and introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding-without requiring extensive training data or any modifications to the existing VLM architectures. Specifically, we propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization, which leverages minimal sensor-specific data to enable robust learning of non-RGB characteristics and overcome RGB-centric biases inherent in current VLMs. In addition, we present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs' sensor-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

top-yun/ms-pr
pytorchOfficial

Datasets

topyun/VS-TDX
dataset· 169 dl
169 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques