A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors

Gia-Bao Doan; Nam-Khoa Huynh; Minh-Nhat-Huy Ho; Khanh-Thanh-Khoa Nguyen; Thanh-Hai Le

arXiv:2603.21048·cs.CV·March 24, 2026

A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors

Gia-Bao Doan, Nam-Khoa Huynh, Minh-Nhat-Huy Ho, Khanh-Thanh-Khoa Nguyen, Thanh-Hai Le

PDF

Open Access

TL;DR

This paper introduces a two-stage transformer-based framework for accurately localizing distracted driver behaviors in in-cabin videos, balancing high performance with computational efficiency for practical driver monitoring applications.

Contribution

The work presents a novel two-stage transformer framework combining VideoMAE features with an augmented self-mask attention detector and multi-scale temporal pooling, tailored for driver behavior localization.

Findings

01

ViT-Giant + SPPF achieves 92.67% mAP.

02

Lightweight ViT-based model attains 82.55% accuracy with lower computational cost.

03

The SPPF module enhances localization performance across configurations.

Abstract

The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Autonomous Vehicle Technology and Safety · Video Surveillance and Tracking Methods