A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems

Yizhou Wang; Sameer Pusegaonkar; Yuxing Wang; Anqi Li; Vishal Kumar; Chetan Sethi; Ganapathy Aiyer; Yun He; Kartikay Thakkar; Swapnil Rathi; Bhushan Rupde; Zheng Tang; Sujit Biswas

arXiv:2601.10819·cs.CV·January 19, 2026

A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems

Yizhou Wang, Sameer Pusegaonkar, Yuxing Wang, Anqi Li, Vishal Kumar, Chetan Sethi, Ganapathy Aiyer, Yun He, Kartikay Thakkar, Swapnil Rathi, Bhushan Rupde, Zheng Tang, Sujit Biswas

PDF

Open Access

TL;DR

This paper introduces a real-time, large-scale 3D object perception framework for outside-in multi-camera systems, combining geometric priors, occlusion-aware ReID, and generative data augmentation to improve accuracy and efficiency.

Contribution

It presents an adapted Sparse4D framework with domain gap bridging, occlusion-aware ReID, and an optimized TensorRT plugin for real-time multi-camera 3D perception in infrastructure environments.

Findings

01

Achieved a state-of-the-art HOTA score of 45.22 on AI City Challenge 2025.

02

Developed a hardware-accelerated implementation with 2.15x speedup on modern GPUs.

03

Supported over 64 concurrent camera streams on a single GPU.

Abstract

Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning "inside-out" autonomous driving models to "outside-in" static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Vision and Imaging