TL;DR
SafetyPairs introduces a scalable method for generating counterfactual image pairs that differ only in safety-critical features, enabling better evaluation and training of models for fine-grained image safety detection.
Contribution
The paper presents a novel framework for creating safety-focused counterfactual image pairs, a new safety benchmark, and demonstrates improved model training efficiency.
Findings
SafetyPairs effectively flips safety labels with targeted image edits.
The benchmark reveals weaknesses in current vision-language models.
Data augmentation with SafetyPairs enhances model sample efficiency.
Abstract
What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper presents a promising data pipeline to perform counterfactual edits on safety features in images. The method appears to be scalable and yields realistic examples that serve as hard negatives for unsafe image detection. - Experimental results on various VLMs of varying size, both open-source and proprietary. The generated data appears universally challenging for all models tested. - Even linear probing with a small number of counterfactual samples (<32) can significantly improve guard
- I feel that the work has yet to realize the full potential of counterfactual probing. I would have liked to see a more in-depth experimental analysis, providing insights into questions such as - What features mattered the most to ground-truth image safety (by analyzing the edited regions/objects? even obtaining the ROIs for unsafe images would be informative), - What features the guard models are the most sensitive to (maybe repeat the data pipeline but only change unimportant features?),
- counterfactual pairs for safety assessments mark a valuable contribution that unearth prevalent failure modes in safeguard models despite saturation of existing benchmarks - assuming that the benchmark and code will be made publicly available these will prove useful to the community - easily builds on top of established safety taxonomies and datasets (here LlavaGuard) - the data creation framework is easily reproducible. The overall setup is described well and relevant prompts for reproduct
# Major - The paper lacks details on the human verification of edit attempted edit pairs and only states that that this was conducted by the authors. While this is generally acceptable (especially in safety related research), at least the Appendix should include some details on the specific setup of this validation, measurements taken to ensure no confounding or bias by the authors and usual demographic statistics on the involved annotators. - Creating synthetic counterfactuals through gener
- The idea is interesting that SAFETYPAIRS contains a pair of images that differ only in safety-related details. So that it can help clearly see if the VLMs understand the safety-related features. - New dataset. - Shed light on the weaknesses of the current VLMs.
- In the data construction, unsafe images are real-world samples, while safe images are synthetically generated through editing. This introduces a potential confounding variable: a classifier trained on such data might learn to distinguish between real and synthetic images rather than between unsafe and safe content. For example, in Section 4.3, the authors train linear probe models. Could the authors analyze to disentangle these two effects and demonstrate that the model is not just learning a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
