An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Md. Sajeebul Islam Sk.; Md. Mehedi Hasan Shawon; Md. Golam Rabiul Alam

arXiv:2604.02502·cs.CV·April 6, 2026

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Md. Sajeebul Islam Sk., Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam

PDF

TL;DR

This paper introduces an explainable vision-language model for lumbar spinal stenosis diagnosis that improves localization, segmentation accuracy, and interpretability using novel modules and loss functions.

Contribution

The authors propose a Spatial Patch Cross-Attention module and an Adaptive PID-Tversky Loss to enhance localization and segmentation in medical imaging, advancing explainability and performance.

Findings

01

Achieved 90.69% diagnostic accuracy

02

Secured a Dice score of 0.9512 for segmentation

03

Generated clinical reports with high CIDEr scores

Abstract

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.