Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

Haoxiang Gao; Li Zhang; Yu Zhao; Zhou Yang; Jinghan Cao

arXiv:2501.06680·cs.CV·July 31, 2025

Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

Haoxiang Gao, Li Zhang, Yu Zhao, Zhou Yang, Jinghan Cao

PDF

TL;DR

This paper introduces a knowledge distillation approach from large vision-language models to improve pedestrian behavior prediction and scene understanding in autonomous driving, achieving enhanced perception and trajectory prediction.

Contribution

It presents a novel knowledge distillation method transferring knowledge from foundation models to efficient networks for better scene understanding in autonomous driving.

Findings

01

Improved open-vocabulary perception accuracy

02

Enhanced trajectory prediction performance

03

More diverse semantic attribute generation

Abstract

Vision-language models (VLMs) have become a promising approach to enhancing perception and decision-making in autonomous driving. The gap remains in applying VLMs to understand complex scenarios interacting with pedestrians and efficient vehicle deployment. In this paper, we propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks, and we apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes. We also utilize multiple pre-trained models and ensemble techniques to boost the model's performance. We further examined the effectiveness of the model after knowledge distillation; the results show significant metric improvements in open-vocabulary perception and trajectory prediction tasks, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsKnowledge Distillation