Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

Leila Cheshmi; Mennatullah Siam

arXiv:2508.14729·cs.CV·August 21, 2025

Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

Leila Cheshmi, Mennatullah Siam

PDF

Open Access

TL;DR

This paper introduces a multiscale video transformer that efficiently performs class-agnostic segmentation in autonomous driving, accurately detecting unknown objects using motion cues without relying on optical flow.

Contribution

It proposes a novel end-to-end trainable video transformer with a memory-centric design and multiscale query decoding, improving efficiency and accuracy over existing methods.

Findings

01

Outperforms multiscale baselines on DAVIS'16, KITTI, and Cityscapes datasets.

02

Maintains high-resolution spatiotemporal features with shared memory.

03

Demonstrates real-time, robust dense prediction suitable for safety-critical robotics.

Abstract

Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Image Segmentation Techniques