JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes
Haimei Zhao, Jing Zhang, Sen Zhang, Dacheng Tao

TL;DR
JPerceiver is a joint perception framework that simultaneously estimates depth, visual odometry, and bird's-eye-view scene layout from monocular videos, leveraging cross-view geometric transformation and attention mechanisms for improved accuracy and efficiency.
Contribution
It introduces a novel end-to-end multi-task learning approach that unifies depth, VO, and BEV layout estimation with cross-view geometric and transfer modules, addressing scale ambiguity issues.
Findings
Outperforms existing methods on Argoverse, Nuscenes, and KITTI datasets.
Achieves higher accuracy in depth, pose, and layout estimation.
Offers a more efficient model with reduced inference time.
Abstract
Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene layout estimation present three critical tasks for driving scene perception, which is fundamental for motion planning and navigation in autonomous driving. Though they are complementary to each other, prior works usually focus on each individual task and rarely deal with all three tasks together. A naive way is to accomplish them independently in a sequential or parallel manner, but there are many drawbacks, i.e., 1) the depth and VO results suffer from the inherent scale ambiguity issue; 2) the BEV layout is directly predicted from the front-view image without using any depth-related information, although the depth map contains useful geometry clues for inferring scene layouts. In this paper, we address these issues by proposing a novel joint perception framework named JPerceiver, which can simultaneously estimate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Adam · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer
