Enhancing Large Vision Model in Street Scene Semantic Understanding through Leveraging Posterior Optimization Trajectory

Wei-Bin Kou; Qingfeng Lin; Ming Tang; Jingreng Lei; Shuai Wang; Rongguang Ye; Guangxu Zhu; and Yik-Chung Wu

arXiv:2501.01710·cs.CV·June 3, 2025

Enhancing Large Vision Model in Street Scene Semantic Understanding through Leveraging Posterior Optimization Trajectory

Wei-Bin Kou, Qingfeng Lin, Ming Tang, Jingreng Lei, Shuai Wang, Rongguang Ye, Guangxu Zhu, and Yik-Chung Wu

PDF

Open Access

TL;DR

This paper introduces a method to enhance large vision models for street scene understanding in autonomous driving by leveraging posterior optimization trajectories, resulting in faster convergence and improved performance.

Contribution

It proposes a novel Posterior Optimization Trajectory (POT)-guided scheme and a POT Generator to accelerate training and improve generalization of large vision models in autonomous driving.

Findings

01

Performance improved by over 66.48%.

02

Convergence accelerated over 6 times.

03

Effectiveness demonstrated through extensive experiments.

Abstract

To improve the generalization of the autonomous driving (AD) perception model, vehicles need to update the model over time based on the continuously collected data. As time progresses, the amount of data fitted by the AD model expands, which helps to improve the AD model generalization substantially. However, such ever-expanding data is a double-edged sword for the AD model. Specifically, as the fitted data volume grows to exceed the the AD model's fitting capacities, the AD model is prone to under-fitting. To address this issue, we propose to use a pretrained Large Vision Models (LVMs) as backbone coupled with downstream perception head to understand AD semantic information. This design can not only surmount the aforementioned under-fitting problem due to LVMs' powerful fitting capabilities, but also enhance the perception generalization thanks to LVMs' vast and diverse training data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods