Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
Yunxiang Yang, Ningning Xu, Jidong J. Yang

TL;DR
This paper presents VISTA, a lightweight vision-language model for traffic scene understanding and risk inference, trained via structured prompting and knowledge distillation from large models, achieving high performance with efficient deployment.
Contribution
Introduces a novel structured prompting and knowledge distillation framework that enables training compact, high-performing traffic scene understanding models from large vision-language models.
Findings
VISTA achieves strong captioning metrics comparable to large models.
The framework enables real-time risk inference on edge devices.
Knowledge distillation maintains complex reasoning in a smaller model.
Abstract
Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
