PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Jinhe Bi; Aniri; Yifan Wang; Danqi Yan; Wenke Huang; Zengjie Jin; Xiaowen Ma; Sikuan Yan; Artur Hecker; Mang Ye; Xun Xiao; Hinrich Schuetze; Volker Tresp; Yunpu Ma

arXiv:2502.12119·cs.CV·January 14, 2026

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Jinhe Bi, Aniri, Yifan Wang, Danqi Yan, Wenke Huang, Zengjie Jin, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

PDF

Open Access

TL;DR

PRISM is a training-free, efficient visual instruction data selection method that addresses redundancy in multimodal datasets by modeling intrinsic visual semantics, significantly reducing tuning time and improving performance.

Contribution

PRISM introduces the first training-free framework for visual instruction selection that models intrinsic visual semantics to remove background influence, enhancing efficiency and effectiveness.

Findings

01

Reduces data selection and tuning time to 30% of traditional methods.

02

Surpasses full dataset fine-tuning performance on multiple benchmarks.

03

Achieves 101.7% relative improvement over baseline.

Abstract

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment · Speech and dialogue systems · Wireless Sensor Networks and IoT