Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis
Maged Shoman, Dongdong Wang, Armstrong Aboah, Mohamed Abdel-Aty

TL;DR
This paper presents a novel approach for dense video captioning in traffic safety videos, combining parallel decoding, CLIP-based visual features, domain adaptation, and knowledge transfer to improve event analysis accuracy.
Contribution
It introduces a parallel dense video captioning framework with CLIP features, domain adaptation, and knowledge transfer, advancing traffic safety video understanding.
Findings
Achieved 6th place in AI City Challenge 2024.
Enhanced captioning accuracy through domain-specific adaptation.
Improved visual-language modeling with CLIP features.
Abstract
This paper introduces our solution for Track 2 in AI City Challenge 2024. The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety (WTS), a real-world Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding. Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video. 2) Our work leverages CLIP to extract visual features to more efficiently perform cross-modality training between visual and textual representations. 3) We conduct domain-specific model adaptation to mitigate domain shift problem that poses recognition challenge in video understanding. 4) Moreover, we leverage BDD-5K captioned videos to conduct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
