# Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems

**Authors:** Jisu Kim, Byoung-Keon D. Park

PMC · DOI: 10.3390/s25206323 · Sensors (Basel, Switzerland) · 2025-10-13

## TL;DR

This paper compares different AI models for recognizing driver and passenger behaviors using body, gaze, and facial data, finding that attention-based models perform best.

## Contribution

The study introduces a comprehensive comparison of temporal modeling approaches for multimodal occupant behavior recognition in vehicles.

## Key findings

- Temporal models outperform static models in occupant behavior recognition.
- The Transformer model achieves a state-of-the-art Macro F1 score of 0.9570.
- Transformers offer a strong balance between performance and computational efficiency.

## Abstract

Understanding occupant behavior is critical for enhancing safety and situational awareness in intelligent transportation systems. This study investigates multimodal occupant behavior recognition using sequential inputs extracted from 2D pose, 2D gaze, and facial movements. We conduct a comprehensive comparative study of three distinct architectural paradigms: a static Multi-Layer Perceptron (MLP), a recurrent Long Short-Term Memory (LSTM) network, and an attention-based Transformer encoder. All experiments are performed on the large-scale Occupant Behavior Classification (OBC) dataset, which contains approximately 2.1 million frames across 79 behavior classes collected in a controlled, simulated environment. Our results demonstrate that temporal models significantly outperform the static baseline. The Transformer model, in particular, emerges as the superior architecture, achieving a state-of-the-art Macro F1 score of 0.9570 with a configuration of a 50-frame span and a step size of 10. Furthermore, our analysis reveals that the Transformer provides an excellent balance between high performance and computational efficiency. These findings demonstrate the superiority of attention-based temporal modeling with multimodal fusion and provide a practical framework for developing robust and efficient in-vehicle occupant monitoring systems. Implementation code and supplementary resources are available (see Data Availability Statement).

## Full-text entities

- **Diseases:** fatigue (MESH:D005221), OBC (MESH:D009784), injury to (MESH:D014947)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12568052/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12568052/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/PMC12568052/full.md

---
Source: https://tomesphere.com/paper/PMC12568052