# Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

**Authors:** Max Torop, Masih Eskandar, Nicholas Kurtansky, Jinyang Liu, Jochen Weber, Octavia Camps, Veronica Rotemberg, Jennifer Dy, Kivanc Kose

arXiv: 2508.20188 · 2025-08-29

## TL;DR

This study investigates enhancing the interpretability of multimodal large language models in skin disease diagnosis by grounding their embeddings in quantitative skin attributes, demonstrated through a retrieval case study.

## Contribution

It introduces a method to ground MLLM embeddings in quantitative skin attributes, improving interpretability in skin lesion diagnosis tasks.

## Key findings

- MLLM embeddings can be grounded in quantitative skin attributes.
- Fine-tuning enables attribute prediction from images.
- Grounded embeddings improve interpretability in retrieval tasks.

## Abstract

Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20188/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20188/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/2508.20188/full.md

---
Source: https://tomesphere.com/paper/2508.20188