DILLEMA: Diffusion and Large Language Models for Multi-Modal   Augmentation

Luciano Baresi; Davide Yi Xian Hu; Muhammad Irfan Mas'udi; Giovanni; Quattrocchi

arXiv:2502.04378·cs.CV·February 10, 2025

DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation

Luciano Baresi, Davide Yi Xian Hu, Muhammad Irfan Mas'udi, Giovanni, Quattrocchi

PDF

Open Access 1 Repo

TL;DR

This paper introduces DILLEMA, a novel framework that combines Large Language Models and Diffusion Models to generate realistic, diverse test cases for evaluating and improving the robustness of vision neural networks.

Contribution

It presents a new method for creating high-fidelity, counterfactual test images from textual descriptions, enhancing robustness testing beyond existing augmentation techniques.

Findings

01

Generated test cases reveal model weaknesses

02

Improved model robustness through targeted retraining

03

High human agreement on image realism

Abstract

Ensuring the robustness of deep learning models requires comprehensive and diverse testing. Existing approaches, often based on simple data augmentation techniques or generative adversarial networks, are limited in producing realistic and varied test cases. To address these limitations, we present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models to generate synthetic, high-fidelity test cases. Our approach begins by translating images into detailed textual descriptions using a captioning model, allowing the language model to identify modifiable aspects of the image and generate counterfactual descriptions. These descriptions are then used to produce new test images through a text-to-image diffusion process that preserves spatial consistency and maintains the critical elements of the scene. We demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deib-polimi/dillema
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsDiffusion