Evian: Towards Explainable Visual Instruction-tuning Data Auditing
Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei

TL;DR
This paper introduces EVIAN, a framework for explainable auditing of visual instruction data, demonstrating that high-quality, curated datasets can outperform larger, noisier ones in training LVLMs.
Contribution
It presents a novel decomposition-based paradigm and a large benchmark for nuanced data auditing, improving model reliability and data quality assessment.
Findings
High-quality curated datasets outperform larger datasets in model training.
Dividing auditing into subtasks enhances robustness and accuracy.
Logical coherence is the most critical factor in data quality evaluation.
Abstract
The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
