Clean Evaluations on Contaminated Visual Language Models
Hongyuan Lu, Shujie Miao, and Wai Lam

TL;DR
This paper introduces a novel data augmentation approach, BGR augmentation, to enable clean evaluation of visual language models by reducing data contamination effects, and presents a new benchmark for this purpose.
Contribution
It proposes BGR augmentation as an effective method for clean evaluation of VLMs and creates a new benchmark dataset for this task.
Findings
Traditional augmentation methods can be exploited during training.
BGR augmentation effectively reduces data contamination effects.
BGR augmentation is not suitable for training, making it ideal for evaluation.
Abstract
How to evaluate large language models (LLMs) cleanly has been established as an important research era to genuinely report the performance of possibly contaminated LLMs. Yet, how to cleanly evaluate the visual language models (VLMs) is an under-studied problem. We propose a novel approach to achieve such goals through data augmentation methods on the visual input information. We then craft a new visual clean evaluation benchmark with thousands of data instances. Through extensive experiments, we found that the traditional visual data augmentation methods are useful, but they are at risk of being used as a part of the training data as a workaround. We further propose using BGR augmentation to switch the colour channel of the visual information. We found that it is a simple yet effective method for reducing the effect of data contamination and fortunately, it is also harmful to be used as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics
