MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain; Alexandros Stergiou

arXiv:2508.07833·cs.CV·April 8, 2026

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain, Alexandros Stergiou

PDF

1 Repo

TL;DR

MIMIC is a novel framework that inverts VLMs' internal encodings to improve interpretability and conceptual understanding of multimodal models.

Contribution

It introduces the first model inversion method specifically designed for visual interpretation of VLMs, incorporating joint inversion and multiple regularizers.

Findings

01

Successfully inverts visual concepts across various VLM outputs.

02

Achieves high scores on visual quality and semantic realism metrics.

03

Provides both qualitative and quantitative insights into VLM internal representations.

Abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anaekin/MIMIC
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.