ScenePilot-4K: A Large-Scale First-Person Dataset and Benchmark for Vision-Language Models in Autonomous Driving

Yujin Wang; Yutong Zheng; Wenxian Fan; Tianyi Wang; Hongqing Chu; Li Zhang; Bingzhao Gao; Daxin Tian; Jianqiang Wang; Hong Chen

arXiv:2601.19582·cs.CV·March 31, 2026

ScenePilot-4K: A Large-Scale First-Person Dataset and Benchmark for Vision-Language Models in Autonomous Driving

Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Li Zhang, Bingzhao Gao, Daxin Tian, Jianqiang Wang, Hong Chen

PDF

1 Repo 1 Datasets

TL;DR

ScenePilot-4K is a comprehensive first-person autonomous driving dataset with diverse annotations, enabling robust vision-language model evaluation across multiple perception and planning tasks.

Contribution

The paper introduces ScenePilot-4K, a large-scale, multi-annotated dataset and benchmark for autonomous driving vision-language models, along with a scalable annotation pipeline.

Findings

01

Current models excel in scene semantics but struggle with geometry and planning reasoning.

02

The benchmark reveals significant domain shift challenges across regions and traffic conditions.

03

Baseline results highlight the need for improved geometry-aware perception in vision-language models.

Abstract

In this paper, we introduce ScenePilot-4K, a large-scale first-person dataset for safety-aware vision-language learning and evaluation in autonomous driving. Built from public online driving videos, ScenePilot-4K contains 3,847 hours of video and 27.7M front-view frames spanning 63 countries/regions and 1,210 cities. It jointly provides scene-level natural-language descriptions, risk assessment labels, key-participant annotations, ego trajectories, and camera parameters through a unified multi-stage annotation pipeline. Building on this dataset, we establish ScenePilot-Bench, a standardized benchmark that evaluates vision-language models along four complementary axes: scene understanding, spatial perception, motion planning, and GPT-based semantic alignment. The benchmark includes fine-grained metrics and geographic generalization settings that expose model robustness under cross-region…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yjwangtj/ScenePilot-4K
github

Datasets

larswangtj/ScenePilot-4K
dataset· 178 dl
178 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.