An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis
Md. Sajeebul Islam Sk., Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam

TL;DR
This paper introduces an explainable vision-language model for lumbar spinal stenosis diagnosis that improves localization, segmentation accuracy, and interpretability using novel modules and loss functions.
Contribution
The authors propose a Spatial Patch Cross-Attention module and an Adaptive PID-Tversky Loss to enhance localization and segmentation in medical imaging, advancing explainability and performance.
Findings
Achieved 90.69% diagnostic accuracy
Secured a Dice score of 0.9512 for segmentation
Generated clinical reports with high CIDEr scores
Abstract
Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
