DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset

Yiming Li; Chen Cai; Tianyi Liu; Dan Lin; Wenqian Wang; Wenfei Liang; Bingbing Li; Kim-Hui Yap

arXiv:2601.11990·cs.CV·January 21, 2026

DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset

Yiming Li, Chen Cai, Tianyi Liu, Dan Lin, Wenqian Wang, Wenfei Liang, Bingbing Li, Kim-Hui Yap

PDF

Open Access

TL;DR

This paper introduces the DAOS dataset for driver action recognition, emphasizing human-object interactions, and proposes the AOR-Net model that leverages multi-modal data and reasoning to improve accuracy.

Contribution

The paper presents the first comprehensive driver action dataset with object annotations and a novel AOR-Net model for improved action recognition through multi-level reasoning.

Findings

01

AOR-Net outperforms existing methods on multiple datasets.

02

DAOS dataset contains 9,787 clips with detailed annotations.

03

The model effectively captures human-object relations in driver actions.

Abstract

In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Social Robot Interaction and HRI