TB-Bench: Training and Testing Multi-Modal AI for Understanding   Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Korawat Charoenpitaks; Van-Quang Nguyen; Masanori Suganuma; Kentaro; Arai; Seiji Totsuka; Hiroshi Ino; Takayuki Okatani

arXiv:2501.05733·cs.CV·January 13, 2025

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Kentaro, Arai, Seiji Totsuka, Hiroshi Ino, Takayuki Okatani

PDF

Open Access 1 Repo

TL;DR

This paper introduces TB-Bench, a new benchmark and datasets for evaluating multi-modal large language models in understanding traffic behaviors from dashcam videos, significantly improving their performance in autonomous driving tasks.

Contribution

It presents TB-Bench, a comprehensive benchmark with new datasets and baselines, addressing the lack of traffic-specific evaluation tools for MLLMs in autonomous driving.

Findings

01

Existing MLLMs perform poorly on traffic tasks, with GPT-4o achieving less than 35% accuracy.

02

Fine-tuning with TB-100k or TB-250k datasets boosts baseline models' accuracy up to 85%.

03

Co-training on TB-100k improves performance on additional traffic datasets.

Abstract

The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tb-ad/tb-bench-110k-250k
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraffic Prediction and Management Techniques