VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

Xiwei Xuan; Xiaoqi Wang; Wenbin He; Jorge Piazentin Ono; Liang Gou; Kwan-Liu Ma; Liu Ren

arXiv:2507.09008·cs.CV·July 15, 2025

VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

Xiwei Xuan, Xiaoqi Wang, Wenbin He, Jorge Piazentin Ono, Liang Gou, Kwan-Liu Ma, Liu Ren

PDF

TL;DR

VISTA is a visual analytics framework designed to improve the quality of labels generated by foundation models for open-vocabulary image segmentation, combining multi-phased validation and human expertise.

Contribution

It introduces a comprehensive visual analytics approach to validate and correct FM-generated labels, addressing the gap in quality assessment for large-scale datasets.

Findings

01

VISTA effectively identifies issues in FM-generated labels.

02

Human-in-the-loop validation improves label quality.

03

Demonstrated success on benchmark datasets.

Abstract

The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.