RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu

TL;DR
RE-VLM is a novel dual-stream vision-language model that combines RGB images and event streams to improve scene understanding, especially under adverse conditions like low light or fast motion.
Contribution
It introduces a joint RGB-event encoder with a progressive training strategy and a graph-driven pipeline for scene graph generation and captioning, addressing data scarcity.
Findings
RE-VLM outperforms state-of-the-art RGB-only and event-only models in captioning and VQA tasks.
The model shows significant improvements under challenging conditions such as low light.
Constructed two datasets, PEOD-Chat and RGBE-Chat, for evaluation in diverse scenarios.
Abstract
Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
