TL;DR
SafePLUG is a framework that enhances multimodal large language models with pixel-level and temporal understanding for detailed traffic accident analysis, improving safety and scene comprehension.
Contribution
It introduces a novel multimodal model with pixel-level and temporal grounding capabilities, along with a new dataset for traffic accident understanding.
Findings
SafePLUG outperforms existing models on region-based question answering and segmentation tasks.
It effectively localizes temporal events and understands complex accident scenarios.
The framework advances fine-grained traffic scene analysis for safety applications.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
