# Long Short-Term Memory–GPT-4 Integration for Interpretable Biomedical Signal Classification: Proof-of-Concept Study

**Authors:** Kapil Kumar Reddy Poreddy, Ajit Sahu, Sanjoy Mukherjee, Bhavan Kumar Basavaraju

PMC · DOI: 10.2196/87962 · 2026-03-20

## TL;DR

This study combines LSTM networks and GPT-4 to classify biomedical signals and generate interpretable clinical reports, aiming to improve diagnostics in resource-limited areas.

## Contribution

The novel integration of LSTM and GPT-4 for interpretable biomedical signal classification in low-resource settings is introduced.

## Key findings

- The LSTM-GPT-4 framework achieved high classification accuracy (92.3% on MIT-BIH, 94.7% on PTB datasets).
- Generated clinical interpretations received high ratings (4.3/5 for accuracy, 4.6/5 for clarity) from board-certified physicians.
- Strong interrater agreement (κ>0.85) indicates consistent evaluation of GPT-4 outputs by medical experts.

## Abstract

Approximately 3.8 billion people lack access to essential health services, and diagnostic interpretation remains a major bottleneck in remote and resource-constrained settings. Limited access to specialists and the complexity of biomedical signal interpretation (eg, electrocardiogram [ECG] and electroencephalogram) contribute to delays in recognizing cardiovascular and neurological conditions.

The study aimed to develop and evaluate a technical framework integrating long short-term memory (LSTM) networks with GPT-4 to provide automated biomedical signal classification and human-readable interpretations, suitable as a foundation for future deployment in resource-constrained environments.

The 2-layer LSTM architecture (128→64 units) was selected based on preliminary experiments comparing configurations ranging from single-layer networks (64, 128 units) to deeper architectures (128→64→32 units). The chosen configuration balanced model capacity against overfitting risk and computational efficiency. The framework was evaluated using public PhysioNet datasets: Massachusetts Institute of Technology–Beth Israel Hospital (MIT-BIH) Arrhythmia, Physikalisch-Technische Bundesanstalt (PTB) Diagnostic ECG, Physikalisch-Technischen Bundesanstalt-extra large, Chapman-Shaoxing, Medical Information Mart for Intensive Care-III Waveforms, and Sleep-European data format. A patient-level split protocol (70/15/15) was used to reduce leakage risk. The LSTM architecture (128→64 units) performed temporal feature extraction with softmax-based classification for mutually exclusive classes. GPT-4 was integrated via an application programming interface with structured prompts to generate clinical interpretations from model outputs.

For the expert evaluation, we randomly sampled 50 test cases per dataset (150 total: 30 from each class for MIT-BIH, 25 per class for PTB, and 20 per class for Children's Hospital Boston-Massachusetts Institute of Technology), ensuring balanced class representation. Three board-certified physicians (2 cardiologists for ECG datasets and 1 neurologist for the electroencephalogram dataset) independently reviewed GPT-4–generated interpretations. Reviewers were blinded to whether signals were correctly or incorrectly classified by the LSTM model. Each interpretation was rated on a 5-point Likert scale (1=clinically inappropriate and 5=highly accurate and clinically useful). Interrater reliability was assessed using Fleiss κ (0.78, substantial agreement). On held-out test sets, classification performance was as follows: MIT-BIH 92.3% accuracy (F1=0.91, AUC=0.95), PTB Diagnostic 94.7% (F1=0.94, AUC=0.97), Physikalisch-Technischen Bundesanstalt-extra large 88.9% (F1=0.88, AUC=0.93), Chapman-Shaoxing 91.2% (F1=0.90, AUC=0.94), Medical Information Mart for Intensive Care-III 89.5% (F1=0.89, AUC=0.92), and Sleep-European data format 87.3% (F1=0.86, AUC=0.91). Expert evaluation of generated interpretations (3 board-certified cardiologists) rated clinical accuracy 4.3 out of 5, clarity 4.6 out of 5, and actionability 4.2 out of 5, with strong interrater agreement (κ>0.85).

This proof-of-concept demonstrates an explicit methodological integration of deep learning–based biomedical signal classification with GPT-4–based interpretation, provides a technical foundation for future prospective clinical validation, field studies, and regulatory review prior to clinical deployment in underserved settings.

## Full-text entities

- **Diseases:** Arrhythmia (MESH:D001145), cardiovascular (MESH:D002318), neurological conditions (MESH:D019636)
- **Species:** Homo sapiens (human, species) [taxon 9606]

---
Source: https://tomesphere.com/paper/PMC13004587