IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
Shaohong Wang, Lu Bin, Xinyu Xiao, Zhiyu Xiang, Hangguan Shan, Eryun, Liu

TL;DR
This paper introduces IFTR, a novel instance-level fusion transformer that enhances camera-only collaborative perception in autonomous driving by improving feature sharing, fusion, and communication efficiency, leading to significant performance gains.
Contribution
The paper proposes an instance-level fusion transformer (IFTR) that effectively fuses multi-agent visual features for collaborative perception, addressing the gap in camera-based systems and optimizing communication.
Findings
Achieved up to 57.96% improvement in AP@70 on DAIR-V2X dataset.
Demonstrated superior performance over state-of-the-art methods on multiple datasets.
Validated the effectiveness of key components through extensive experiments.
Abstract
Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving in recent years. However, current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images. This severely impedes the development of budget-constrained collaborative systems and the exploitation of the advantages offered by the camera modality. This work proposes an instance-level fusion transformer for visual collaborative perception (IFTR), which enhances the detection performance of camera-only collaborative perception systems through the communication and sharing of visual features. To capture the visual information from multiple agents, we design an instance feature aggregation that interacts with the visual features of individual agents using predefined grid-shaped bird eye view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need
