DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

Md. Najib Hasan (1); Imran Ahmad (1); Sourav Basak Shuvo (2); Md. Mahadi Hasan Ankon (2); Sunanda Das (3); Nazmul Siddique (4); Hui Wang (5) ((1) Wichita State University; USA; (2) Khulna University of Engineering; Technology; Bangladesh; (3) University of Arkansas; USA; (4) Ulster University; UK; (5) Queen's University Belfast; UK)

arXiv:2512.13742·cs.CV·February 24, 2026

DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

Md. Najib Hasan (1), Imran Ahmad (1), Sourav Basak Shuvo (2), Md. Mahadi Hasan Ankon (2), Sunanda Das (3), Nazmul Siddique (4), Hui Wang (5) ((1) Wichita State University, USA, (2) Khulna University of Engineering, Technology, Bangladesh, (3) University of Arkansas, USA

PDF

Open Access

TL;DR

This paper presents DL$^3$M, a framework combining deep learning and large language models to improve medical reasoning from endoscopic images, highlighting current limitations in model stability and reliability for clinical use.

Contribution

Introduces MobileCoAtNet for high-accuracy image classification and evaluates multiple LLMs on expert-verified benchmarks for clinical reasoning, revealing current limitations.

Findings

01

High classification accuracy with MobileCoAtNet.

02

LLMs' explanations improve with better classification but remain unstable.

03

No LLMs achieve human-level stability in reasoning.

Abstract

Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare