Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Noreen Anwar; Guillaume-Alexandre Bilodeau; and Wassim Bouachir

arXiv:2508.04868·cs.CV·August 8, 2025

Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Noreen Anwar, Guillaume-Alexandre Bilodeau, and Wassim Bouachir

PDF

TL;DR

DAMM introduces a dual-stream attention framework with multi-modal queries to enhance object detection accuracy and efficiency in complex transportation scenes, addressing occlusion and localization challenges.

Contribution

The paper presents DAMM, a novel object detection framework that combines multi-modal queries and dual-stream attention for improved performance in cluttered environments.

Findings

01

Achieved state-of-the-art AP and recall on four benchmarks.

02

Demonstrated effectiveness of multi-modal query adaptation.

03

Improved localization in occluded and cluttered scenes.

Abstract

Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.