# Hybrid Vision Transformer–CNN Framework for Alzheimer’s Disease Cell Type Classification: A Comparative Study with Vision–Language Models

**Authors:** Md Easin Hasan, Md Tahmid Hasan Fuad, Omar Sharif, Amy Wagler

PMC · DOI: 10.3390/jimaging12030098 · Journal of Imaging · 2026-02-25

## TL;DR

This paper introduces a hybrid vision transformer–CNN framework for classifying Alzheimer’s disease-related cell types from microscopy images and compares it with vision–language models.

## Contribution

The novel hybrid ViT–CNN framework outperforms standalone CNNs and prompt-based LLMs in classifying AD-related cell types under data-limited conditions.

## Key findings

- The hybrid model achieves 61.03% test accuracy and 61.85 macro F1 score for cell classification.
- It outperforms conventional CNNs and vision–language models in data-limited scenarios.
- The hybrid approach improves generalization by combining CNN and transformer strengths.

## Abstract

Accurate identification of Alzheimer’s disease (AD)-related cellular characteristics from microscopy images is essential for understanding neurodegenerative mechanisms at the cellular level. While most computational approaches focus on macroscopic neuroimaging modalities, cell type classification from microscopy remains relatively underexplored. In this study, we propose a hybrid vision transformer–convolutional neural network (ViT–CNN) framework that integrates DeiT-Small and EfficientNet-B7 to classify three AD-related cell types—astrocytes, cortical neurons, and SH-SY5Y neuroblastoma cells—from phase-contrast microscopy images. We perform a comparative evaluation against conventional CNN architectures (DenseNet, ResNet, InceptionNet, and MobileNet) and prompt-based multimodal vision–language models (GPT-5, GPT-4o, and Gemini 2.5-Flash) using zero-shot, few-shot, and chain-of-thought prompting. Experiments conducted with stratified fivefold cross-validation show that the proposed hybrid model achieves a test accuracy of 61.03% and a macro F1 score of 61.85, outperforming standalone CNN baselines and prompt-only LLM approaches under data-limited conditions. These results suggest that combining convolutional inductive biases with transformer-based global context modeling can improve generalization for cellular microscopy classification. While constrained by dataset size and scope, this work serves as a proof of concept and highlights promising directions for future research in domain-specific pretraining, multimodal data integration, and explainable AI for AD-related cellular analysis.

## Linked entities

- **Diseases:** Alzheimer’s disease (MONDO:0004975)

## Full-text entities

- **Genes:** APP (amyloid beta precursor protein) [NCBI Gene 351] {aka AAA, ABETA, ABPP, AD1, APPI, CTFgamma}
- **Diseases:** neurodegenerative disease (MESH:D019636), atrophy (MESH:D001284), injury to (MESH:D014947), neuronal death (MESH:D009410), LLM (MESH:D007806), AD (MESH:D000544), neuroblastoma (MESH:D009447)
- **Chemicals:** AdamW (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** SH-SY5Y — Homo sapiens (Human), Neuroblastoma, Cancer cell line (CVCL_0019)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13028275/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13028275/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/PMC13028275/full.md

---
Source: https://tomesphere.com/paper/PMC13028275