SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

Alec Helbling; Shruti Palaskar; Kundan Krishna; Polo Chau; Leon Gatys; Joseph Yitan Cheng

arXiv:2510.21120·cs.CV·October 27, 2025

SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

Alec Helbling, Shruti Palaskar, Kundan Krishna, Polo Chau, Leon Gatys, Joseph Yitan Cheng

PDF

3 Reviews

TL;DR

SafetyPairs introduces a scalable method for generating counterfactual image pairs that differ only in safety-critical features, enabling better evaluation and training of models for fine-grained image safety detection.

Contribution

The paper presents a novel framework for creating safety-focused counterfactual image pairs, a new safety benchmark, and demonstrates improved model training efficiency.

Findings

01

SafetyPairs effectively flips safety labels with targeted image edits.

02

The benchmark reveals weaknesses in current vision-language models.

03

Data augmentation with SafetyPairs enhances model sample efficiency.

Abstract

What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The paper presents a promising data pipeline to perform counterfactual edits on safety features in images. The method appears to be scalable and yields realistic examples that serve as hard negatives for unsafe image detection. - Experimental results on various VLMs of varying size, both open-source and proprietary. The generated data appears universally challenging for all models tested. - Even linear probing with a small number of counterfactual samples (<32) can significantly improve guard

Weaknesses

- I feel that the work has yet to realize the full potential of counterfactual probing. I would have liked to see a more in-depth experimental analysis, providing insights into questions such as - What features mattered the most to ground-truth image safety (by analyzing the edited regions/objects? even obtaining the ROIs for unsafe images would be informative), - What features the guard models are the most sensitive to (maybe repeat the data pipeline but only change unimportant features?),

Reviewer 02Rating 6Confidence 5

Strengths

- counterfactual pairs for safety assessments mark a valuable contribution that unearth prevalent failure modes in safeguard models despite saturation of existing benchmarks - assuming that the benchmark and code will be made publicly available these will prove useful to the community - easily builds on top of established safety taxonomies and datasets (here LlavaGuard) - the data creation framework is easily reproducible. The overall setup is described well and relevant prompts for reproduct

Weaknesses

# Major - The paper lacks details on the human verification of edit attempted edit pairs and only states that that this was conducted by the authors. While this is generally acceptable (especially in safety related research), at least the Appendix should include some details on the specific setup of this validation, measurements taken to ensure no confounding or bias by the authors and usual demographic statistics on the involved annotators. - Creating synthetic counterfactuals through gener

Reviewer 03Rating 2Confidence 4

Strengths

- The idea is interesting that SAFETYPAIRS contains a pair of images that differ only in safety-related details. So that it can help clearly see if the VLMs understand the safety-related features. - New dataset. - Shed light on the weaknesses of the current VLMs.

Weaknesses

- In the data construction, unsafe images are real-world samples, while safe images are synthetically generated through editing. This introduces a potential confounding variable: a classifier trained on such data might learn to distinguish between real and synthetic images rather than between unsafe and safe content. For example, in Section 4.3, the authors train linear probe models. Could the authors analyze to disentangle these two effects and demonstrate that the model is not just learning a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.