# A Novel CNN–ViT Model with Cascade Upsampling for Efficient Crack Segmentation

**Authors:** Ahmed Tibermacine, Imad Eddine Tibermacine, Zineddine S. Kahhoul, Ilyes Naidji, Abdelaziz Rabehi, Mustapha Habib

PMC · DOI: 10.3390/s26051667 · Sensors (Basel, Switzerland) · 2026-03-06

## TL;DR

This paper introduces a new hybrid model for crack segmentation in infrastructure images that balances accuracy and efficiency for real-world use.

## Contribution

A novel CNN–ViT architecture with cascade upsampling and a composite loss function for efficient crack segmentation.

## Key findings

- The model outperforms existing convolutional, Transformer-based, and hybrid baselines on four public benchmarks.
- Runtime profiling shows low latency and memory usage suitable for real-time deployment on edge devices.
- Ablation studies confirm the effectiveness of each architectural component.

## Abstract

Accurate crack segmentation in civil infrastructure imagery remains challenging because of the prevalence of thin, low-contrast, and spatially discontinuous defects that often appear amid textured surfaces, shadows, and acquisition noise. Although Transformer-based models improve global context modeling, many existing solutions incur substantial computational and memory overhead, which limits their use in practical, resource-constrained inspection settings. In this work, we introduce an efficient hybrid segmentation architecture that combines a convolutional encoder for high-fidelity local representation with a lightweight Transformer bottleneck for global dependency modeling, followed by a progressive decoder that restores spatial resolution through multi-level skip-feature fusion. To better accommodate severe foreground sparsity and preserve fine crack structures, the framework is trained with a composite Dice–Binary Cross-Entropy objective and employs a tokenization strategy designed to preserve fine spatial details while enabling efficient global context modeling. We validate the proposed approach on four public benchmarks, demonstrating consistent improvements over representative convolutional, Transformer-based, and hybrid baselines, while ablation studies confirm the contribution of each design component. Finally, runtime profiling shows favorable latency and memory characteristics, supporting real-time or near real-time deployment on embedded and edge inspection platforms.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12986669/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12986669/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/PMC12986669/full.md

---
Source: https://tomesphere.com/paper/PMC12986669