T-MASK: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring
Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg

TL;DR
This paper introduces T-MASK, a temporal masking probing method that enhances foundation model robustness for driver monitoring across camera views, especially in low-data and cross-view scenarios.
Contribution
The paper proposes T-MASK, a novel temporal token masking approach that improves cross-view recognition accuracy without additional parameters, advancing foundation model adaptation in driver monitoring.
Findings
T-MASK outperforms baseline probing methods by +1.23% in top-1 accuracy.
T-MASK achieves +8.0% improvement over PEFT methods.
T-MASK significantly boosts recognition of secondary activities in driver monitoring.
Abstract
Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers ('probing'), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce T-MASK -- a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive&Act dataset, T-MASK…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
