# Intelligent Attention-Driven Deep Learning for Hip Disease Diagnosis: Fusing Multimodal Imaging and Clinical Text for Enhanced Precision and Early Detection

**Authors:** Jinming Zhang, He Gong, Pengling Ren, Shuyu Liu, Zhengbin Jia, Lizhen Wang, Yubo Fan

PMC · DOI: 10.3390/medicina62020250 · 2026-01-24

## TL;DR

This paper introduces a deep learning model that combines imaging and clinical text to improve early diagnosis of hip diseases.

## Contribution

A novel attention-based multimodal framework for hip disease diagnosis using imaging and clinical text data.

## Key findings

- The multimodal model achieved an AUC of 0.949, outperforming single-modality approaches.
- Grad-CAM visualizations confirmed the model's focus on clinically relevant regions.
- The framework showed consistent improvements in accuracy, sensitivity, and specificity.

## Abstract

Background and Objectives: Hip joint disorders exhibit diverse and overlapping radiological features, complicating early diagnosis and limiting the diagnostic value of single-modality imaging. Isolated imaging or clinical data may therefore inadequately represent disease-specific pathological characteristics. Materials and Methods: This retrospective study included 605 hip joints from Center A (2018–2024), comprising normal hips, osteoarthritis, osteonecrosis of the femoral head (ONFH), and femoroacetabular impingement (FAI). An independent cohort of 24 hips from Center B (2024–2025) was used for external validation. A multimodal deep learning framework was developed to jointly analyze radiographs, CT volumes, and clinical texts. Features were extracted using ResNet50, 3D-ResNet50, and a pretrained BERT model, followed by attention-based fusion for four-class classification. Results: The combined Clinical+X-ray+CT model achieved an AUC of 0.949 on the internal test set, outperforming all single-modality models. Improvements were consistently observed in accuracy, sensitivity, specificity, and decision curve analysis. Grad-CAM visualizations confirmed that the model attended to clinically relevant anatomical regions. Conclusions: Attention-based multimodal feature fusion substantially improves diagnostic performance for hip joint diseases, providing an interpretable and clinically applicable framework for early detection and precise classification in orthopedic imaging.

## Linked entities

- **Diseases:** osteoarthritis (MONDO:0005178)

## Full-text entities

- **Diseases:** hip OA (MESH:D015207), hip fracture (MESH:D006620), Hip Disease (MESH:D006617), Hip joint diseases (MESH:D007592), visual fatigue (MESH:D001248), osteoporosis (MESH:D010024), cardiovascular diseases (MESH:D002318), subchondral marrow lesions (MESH:D001845), synovitis (MESH:D013585), hip disorders (MESH:D006618), deformity (MESH:D009140), degeneration (MESH:D009410), bony impingement (MESH:D018213), Necrotic (MESH:D009336), groin pain (MESH:D010146), vascular impairment (MESH:D020141), orthopaedic diseases (MESH:D004194), degenerative (MESH:D019636), injury to (MESH:D014947), avascular necrosis of the femoral head (MESH:D005271), marrow edema (MESH:D004487), renal insufficiency (MESH:D051437), FAI (MESH:D057925), Osteonecrosis (MESH:D010020), hip AI (MESH:D025981), fatigue (MESH:D005221), impingement syndrome (MESH:D019534), sclerosis (MESH:D012598), density (MESH:D001851), osteonecrosis of the femoral head (MESH:D000070603), necrotic lesions (MESH:D009059), bone structural failure (MESH:D000080983), chondral damage (MESH:D020263), tears (MESH:D012167), cartilage degeneration (MESH:D002357), OA (MESH:D010003), acetabular (OMIM:142700)
- **Chemicals:** X (-), steroid (MESH:D013256)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12942138/full.md

---
Source: https://tomesphere.com/paper/PMC12942138