TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
Han Gong, Zhen Zhou, Yunyang Shi, Yan Tan, Jinbiao Huo, Qi Hong, and Zhiyuan Liu

TL;DR
TRIP-Evaluate is a comprehensive multimodal benchmark designed to evaluate large models' capabilities across transportation tasks involving text, images, and point-cloud data, addressing existing gaps in specialized assessment tools.
Contribution
It introduces a new open benchmark with 837 items covering diverse transportation functions, enabling detailed diagnosis and comparison of multimodal large models.
Findings
Text performance is improving across models.
Weaknesses remain in engineering calculations and scene understanding.
Benchmark supports fine-grained failure mode diagnosis.
Abstract
Large language models (LLMs) and multimodal large models (MLLMs) are increasingly used for transportation tasks such as regulation question answering, traffic management support, engineering review, and autonomous-driving scene reasoning. Yet transportation workflows are rule-intensive, computation-intensive, safety-critical, and inherently multimodal. Existing general benchmarks provide limited evidence of whether a model can apply regulations correctly, perform verifiable engineering calculations, or interpret traffic scenes reliably, while the small number of public transportation benchmarks remain narrow in scope and rarely support fine-grained diagnosis across text, images, and point-cloud data. To address this gap, we present TRIP-Evaluate, an open multimodal benchmark for large models in transportation. The benchmark organizes 837 items using a role-task-knowledge taxonomy that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
