LET-US: Long Event-Text Understanding of Scenes
Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu

TL;DR
LET-US introduces a novel framework for understanding long event streams from event cameras, employing adaptive compression and cross-modal techniques to improve interpretation and semantic comprehension over extended sequences.
Contribution
The paper presents a new approach for long event-stream understanding, including a large-scale dataset, a hierarchical model, and a comprehensive benchmark, advancing cross-modal scene comprehension.
Findings
Outperforms prior models in accuracy and comprehension on long event streams
Effectively compresses event data while preserving critical details
Achieves state-of-the-art results across multiple tasks
Abstract
Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
