LET-US: Long Event-Text Understanding of Scenes

Rui Chen; Xingyu Chen; Shaoan Wang; Shihan Kong; Junzhi Yu

arXiv:2508.07401·cs.CV·August 12, 2025

LET-US: Long Event-Text Understanding of Scenes

Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu

PDF

Open Access

TL;DR

LET-US introduces a novel framework for understanding long event streams from event cameras, employing adaptive compression and cross-modal techniques to improve interpretation and semantic comprehension over extended sequences.

Contribution

The paper presents a new approach for long event-stream understanding, including a large-scale dataset, a hierarchical model, and a comprehensive benchmark, advancing cross-modal scene comprehension.

Findings

01

Outperforms prior models in accuracy and comprehension on long event streams

02

Effectively compresses event data while preserving critical details

03

Achieves state-of-the-art results across multiple tasks

Abstract

Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition