GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
Wentao Wang, Haoran Xu, Guang Tan

TL;DR
GT-Space introduces a scalable framework for heterogeneous collaborative perception in autonomous driving by mapping features into a ground-truth-based common space, simplifying data fusion and improving detection accuracy.
Contribution
It proposes a novel ground-truth feature space for scalable, unified feature alignment among heterogeneous agents, reducing complexity in multi-agent perception systems.
Findings
Outperforms baselines in detection accuracy on multiple datasets
Provides robust performance across diverse sensing modalities
Eliminates the need for pairwise encoder retraining
Abstract
In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling {\em heterogeneous} features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose {\em GT-Space}, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The idea of aligning all heterogeneous features in the GT space to construct a common, unified feature space is well-motivated and novel. 2. The designed system is highly scalable and flexible. Because individual agents' encoders and detection heads are kept frozen, a new or unseen agent can be integrated simply by training a single, lightweight projector module to map its features to the GT-Space. 3. The framework's performance is not bottlenecked by the capability of the ego agent. By usin
1. I have a major concern in the scalability of fusion training. The fusion network is trained using a "combinatorial contrastive loss" across all pairs of modalities. The paper gives an example with 3 models, resulting in 3 pairs. This implies that for $M$ distinct modality types, the training complexity is $O(M^2)$. This is not scalable 2. The visualization in Figure 5 is not very insightful. It simply shows that the fused feature map has stronger activations than the original. More compelling
1. GT-space requires only one projector per agent, without pairwise adapters or encoder retraining which is scalable. 2. GT-space uses contrastive learning to handle the bottleneck effect in collaborative perception. 3. The experiments show that GT-space has SOTA performance among widely used datasets.
This work requires accurate, dense 3D labels to build the common space, which is often impractical in real-world deployments. The reviewer suggests adding robustness experiments, such as how the pipeline handles communication latency and localization errors. This work evaluates only 3D detection; performance on other tasks, such as lane segmentation or tracking, is unclear.
1.Common Feature Space for Heterogeneous Agents: GT-Space introduces a common feature space for aligning heterogeneous agents, which promotes the practical deployment of heterogeneous collaborative perception systems. 2.Contrastive Learning for Consistent Representation: By employing contrastive learning to supervise the fusion network, GT-Space encourages different agents to learn consistent feature representations for the same instances, improving robustness. 3.Superior Performance Across Mo
1.Generalization to Unseen Agent Types: The paper trains and tests with the same set of agent types (e.g., agents A1-A4 are used for both training and testing under various combinations). However, in real-world scenarios, new, unseen agent types may need to be integrated into the system. Without ground truth labels for these new agents, how can the projection layer be trained to adapt to the fusion model for collaborative perception? What would be the estimated amount of training data and traini
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
