# A hybrid CNN-ViT based framework for automatic traffic actions detection in smart cities

**Authors:** Mucahit Karaduman, Neunggyu Han, Gulsah Karaduman, Muhammed Yildirim, Yongwon Cho, Yunyoung Nam

PMC · DOI: 10.1371/journal.pone.0339796 · 2026-01-16

## TL;DR

This paper introduces a new AI framework combining CNN and ViT models to detect traffic accidents and hazards in smart cities with high accuracy.

## Contribution

A novel hybrid CNN-ViT model is proposed for traffic action detection, achieving 96.88% accuracy.

## Key findings

- Combining CNN and ViT features improves traffic action detection performance.
- The proposed model outperforms ten other models with an accuracy of 96.88%.
- The framework supports timely emergency response and sustainable urban life.

## Abstract

It is crucial to automatically detect traffic accidents and hazardous situations in a timely and accurate manner. In this way, both individual security will be ensured and significant contributions will be made to economic efficiency and sustainable urban life. Millions of people die in traffic accidents every year. This situation also places an additional burden on health systems and will lead to many undesirable consequences. Early detection of events such as traffic density, accidents, and road closures accelerates emergency response processes, regulates traffic flow, and prevents secondary accidents. Therefore, artificial intelligence-supported automatic systems stand out as a key component of smart cities. This study aims to detect traffic accidents and traffic situations automatically. For this purpose, feature extraction was performed with five Convolutional Neural Network (CNN) and five Vision Transformer (ViT) based models. Then, the features obtained from these models were evaluated in different classifiers. The ViT model and the CNN model, which yielded the most successful results, served as the base for the proposed model. The features obtained from the best ViT model and CNN model were combined to bring together different features of the same image. Then, these features were classified into eight different categories using various classifiers. It was observed that the proposed model produced more successful results than the ten models whose preliminary results were obtained in the study. The accuracy value of the proposed model was 96.88%. This value is promising for future studies and plays a strategic role in terms of sustainability and enhancing the quality of life in smart cities.

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** death (MESH:D003643), Injuries (MESH:D014947), Accidents (MESH:D000081084), crashes (MESH:C536029), damage (MESH:D020263)
- **Chemicals:** DINO (-), carbon (MESH:D002244)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** B16 — Mus musculus (Mouse), Hybridoma (CVCL_U043)

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12810786/full.md

---
Source: https://tomesphere.com/paper/PMC12810786