BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles

Seyed Ahmad Hosseini Miangoleh; Amin Jalal Aghdasian; Farzaneh Abdollahi

arXiv:2510.22370·cs.RO·October 28, 2025

BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles

Seyed Ahmad Hosseini Miangoleh, Amin Jalal Aghdasian, Farzaneh Abdollahi

PDF

TL;DR

BLIP-FusePPO introduces a multimodal reinforcement learning framework that fuses vision-language embeddings with geometric and control data for improved lane-keeping in autonomous vehicles, enhancing robustness and generalization.

Contribution

It presents a novel method that directly embeds semantic features into the state representation for autonomous lane-keeping, improving efficiency and robustness over existing approaches.

Findings

01

Outperforms baseline models in lane-keeping stability

02

Demonstrates better adaptability in complex driving scenarios

03

Enhances learning efficiency through a hybrid reward function

Abstract

In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.