A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments
Anthony Dontoh, Stephanie Ivey, Armstrong Aboah

TL;DR
This study evaluates how incorporating road-facing views with driver-facing footage affects distraction detection accuracy in naturalistic driving, revealing that architecture design critically influences the benefits of contextual inputs.
Contribution
It provides a systematic comparison of single- and dual-view distraction detection models using real-world data, highlighting architecture-dependent performance impacts.
Findings
SlowOnly improved by 9.8% with dual-view inputs.
SlowFast experienced a 7.2% accuracy drop with dual-view inputs.
Architecture design determines whether contextual inputs enhance or hinder detection.
Abstract
Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
