Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving
Haoxiang Gao, Li Zhang, Yu Zhao, Zhou Yang, Jinghan Cao

TL;DR
This paper introduces a knowledge distillation approach from large vision-language models to improve pedestrian behavior prediction and scene understanding in autonomous driving, achieving enhanced perception and trajectory prediction.
Contribution
It presents a novel knowledge distillation method transferring knowledge from foundation models to efficient networks for better scene understanding in autonomous driving.
Findings
Improved open-vocabulary perception accuracy
Enhanced trajectory prediction performance
More diverse semantic attribute generation
Abstract
Vision-language models (VLMs) have become a promising approach to enhancing perception and decision-making in autonomous driving. The gap remains in applying VLMs to understand complex scenarios interacting with pedestrians and efficient vehicle deployment. In this paper, we propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks, and we apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes. We also utilize multiple pre-trained models and ensemble techniques to boost the model's performance. We further examined the effectiveness of the model after knowledge distillation; the results show significant metric improvements in open-vocabulary perception and trajectory prediction tasks, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
