# Vision-Based People Counting and Tracking for Urban Environments

**Authors:** Daniyar Nurseitov, Kairat Bostanbekov, Nazgul Toiganbayeva, Aidana Zhalgas, Didar Yedilkhan, Beibut Amirgaliyev

PMC · DOI: 10.3390/jimaging12010027 · 2026-01-05

## TL;DR

This paper introduces a computer vision system for accurately counting and tracking people in urban transport settings using deep learning.

## Contribution

The paper presents a modified DeepSORT tracking pipeline and a unified architecture for detection, tracking, and event logging in dense urban environments.

## Key findings

- The proposed system achieved 92% detection accuracy and 85% counting accuracy using a new dataset of 4047 images.
- YOLOv8 outperformed Mask R-CNN and DETR in speed, accuracy, and computational efficiency.
- The system generates annotated video streams and event logs, offering a scalable alternative to traditional passenger counting methods.

## Abstract

Population growth and expansion of urban areas increase the need for the introduction of intelligent passenger traffic monitoring systems. Accurate estimation of the number of passengers is an important condition for improving the efficiency, safety and quality of transport services. This paper proposes an approach to the automatic detection and counting of people using computer vision and deep learning methods. While YOLOv8 and DeepSORT have been widely explored individually, our contribution lies in a task-specific modification of the DeepSORT tracking pipeline, optimized for dense passenger environments, strong occlusions, and dynamic lighting, as well as in a unified architecture that integrates detection, tracking, and automatic event-log generation. Our new proprietary dataset of 4047 images and 8918 labeled objects has achieved 92% detection accuracy and 85% counting accuracy, which confirms the effectiveness of the solution. Compared to Mask R-CNN and DETR, the YOLOv8 model demonstrates an optimal balance between speed, accuracy, and computational efficiency. The results confirm that computer vision can become an efficient and scalable replacement for traditional sensory passenger counting systems. The developed architecture (YOLO + Tracking) combines recognition, tracking and counting of people into a single system that automatically generates annotated video streams and event logs. In the future, it is planned to expand the dataset, introduce support for multicamera integration, and adapt the model for embedded devices to improve the accuracy and energy efficiency of the solution in real-world conditions.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12843365/full.md

---
Source: https://tomesphere.com/paper/PMC12843365