TL;DR
RailVQA introduces a new benchmark and framework for interpretable visual cognition in autonomous train operation, addressing safety-critical perception and reasoning challenges with efficient, generalizable models.
Contribution
It presents RailVQA-bench, a comprehensive VQA dataset for railway scenarios, and RailVQA-CoM, a collaborative model framework combining small and large models for better efficiency and cognition.
Findings
Significant performance improvements in visual perception and reasoning tasks.
Enhanced interpretability and efficiency in autonomous train systems.
Better cross-domain generalization demonstrated through experiments.
Abstract
As Automatic Train Operation (ATO) advances toward GoA4 and beyond, it increasingly depends on efficient, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
