Enhancing Traffic Safety with Parallel Dense Video Captioning for   End-to-End Event Analysis

Maged Shoman; Dongdong Wang; Armstrong Aboah; Mohamed Abdel-Aty

arXiv:2404.08229·cs.CV·April 15, 2024·1 cites

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

Maged Shoman, Dongdong Wang, Armstrong Aboah, Mohamed Abdel-Aty

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel approach for dense video captioning in traffic safety videos, combining parallel decoding, CLIP-based visual features, domain adaptation, and knowledge transfer to improve event analysis accuracy.

Contribution

It introduces a parallel dense video captioning framework with CLIP features, domain adaptation, and knowledge transfer, advancing traffic safety video understanding.

Findings

01

Achieved 6th place in AI City Challenge 2024.

02

Enhanced captioning accuracy through domain-specific adaptation.

03

Improved visual-language modeling with CLIP features.

Abstract

This paper introduces our solution for Track 2 in AI City Challenge 2024. The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety (WTS), a real-world Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding. Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video. 2) Our work leverages CLIP to extract visual features to more efficiently perform cross-modality training between visual and textual representations. 3) We conduct domain-specific model adaptation to mitigate domain shift problem that poses recognition challenge in video understanding. 4) Moreover, we leverage BDD-5K captioned videos to conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucf-sst-lab/aicity2024cvprw
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training