UNBOX: Unveiling Black-box visual models with Natural-language
Simone Carnemolla, Chiara Russo, Simone Palazzo, Quentin Bouniot, Daniela Giordano, Zeynep Akata, Matteo Pennisi, Concetto Spampinato

TL;DR
UNBOX is a novel framework that uses language and diffusion models to interpret black-box visual models without internal access, revealing learned concepts, biases, and training data characteristics.
Contribution
It introduces a data-free, gradient-free method leveraging LLMs and diffusion models for class-wise model dissection under black-box constraints.
Findings
UNBOX produces human-interpretable descriptors for model classes.
It performs competitively with white-box interpretability methods.
UNBOX reveals model concepts and biases without internal access.
Abstract
Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
