TL;DR
MonoIA introduces a semantic, language-grounded approach to monocular 3D object detection, making it robust to camera intrinsic variations and improving performance across multiple benchmarks.
Contribution
The paper presents MonoIA, a novel intrinsic-aware framework that models camera intrinsics as perceptual transformations using language and vision-language models, enhancing cross-camera robustness.
Findings
Achieves state-of-the-art results on KITTI, Waymo, and nuScenes benchmarks.
Improves KITTI detection performance by +1.18% and multi-dataset training by +4.46%.
Demonstrates robustness to intrinsic variations through semantic intrinsic embeddings.
Abstract
Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
