TL;DR
CollideNet is a hierarchical transformer architecture designed for accurate time-to-collision forecasting in videos, effectively capturing multi-scale spatial and temporal patterns and disentangling non-stationary components.
Contribution
Introduces a novel hierarchical transformer model that captures multi-scale features and disentangles components for improved TTC forecasting accuracy.
Findings
Achieves state-of-the-art performance on three public datasets.
Demonstrates strong generalization across different datasets.
Visualizes disentanglement of trend and seasonality in video data.
Abstract
Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
