Clean Evaluations on Contaminated Visual Language Models

Hongyuan Lu; Shujie Miao; and Wai Lam

arXiv:2410.07030·cs.CV·October 10, 2024

Clean Evaluations on Contaminated Visual Language Models

Hongyuan Lu, Shujie Miao, and Wai Lam

PDF

Open Access

TL;DR

This paper introduces a novel data augmentation approach, BGR augmentation, to enable clean evaluation of visual language models by reducing data contamination effects, and presents a new benchmark for this purpose.

Contribution

It proposes BGR augmentation as an effective method for clean evaluation of VLMs and creates a new benchmark dataset for this task.

Findings

01

Traditional augmentation methods can be exploited during training.

02

BGR augmentation effectively reduces data contamination effects.

03

BGR augmentation is not suitable for training, making it ideal for evaluation.

Abstract

How to evaluate large language models (LLMs) cleanly has been established as an important research era to genuinely report the performance of possibly contaminated LLMs. Yet, how to cleanly evaluate the visual language models (VLMs) is an under-studied problem. We propose a novel approach to achieve such goals through data augmentation methods on the visual input information. We then craft a new visual clean evaluation benchmark with thousands of data instances. Through extensive experiments, we found that the traditional visual data augmentation methods are useful, but they are at risk of being used as a part of the training data as a workaround. We further propose using BGR augmentation to switch the colour channel of the visual information. We found that it is a simple yet effective method for reducing the effect of data contamination and fortunately, it is also harmful to be used as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics