CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

Aranya Saha; Tanvir Ahmed Khan; Ismam Nur Swapnil; Mohammad Ariful Haque

arXiv:2508.18430·cs.CV·August 27, 2025

CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque

PDF

TL;DR

CLARIFY is a lightweight dermatological VQA framework that combines a fast, accurate specialist classifier with a compressed generalist VLM, improving diagnostic accuracy and efficiency for clinical use.

Contribution

Introduces CLARIFY, a hierarchical Specialist-Generalist framework that enhances dermatological VQA accuracy and efficiency through domain-specific classification and knowledge-grounded reasoning.

Findings

01

Achieves 18% higher diagnostic accuracy than baseline models.

02

Reduces VRAM usage and latency by at least 20% and 5%.

03

Demonstrates practical viability for clinical deployment.

Abstract

Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist's predictions directly guide the Generalist's reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.