Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications
Noreen Anwar, Guillaume-Alexandre Bilodeau, and Wassim Bouachir

TL;DR
DAMM introduces a dual-stream attention framework with multi-modal queries to enhance object detection accuracy and efficiency in complex transportation scenes, addressing occlusion and localization challenges.
Contribution
The paper presents DAMM, a novel object detection framework that combines multi-modal queries and dual-stream attention for improved performance in cluttered environments.
Findings
Achieved state-of-the-art AP and recall on four benchmarks.
Demonstrated effectiveness of multi-modal query adaptation.
Improved localization in occluded and cluttered scenes.
Abstract
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
