Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry
Yunus Bilge Kurt, Ahmet Akman, A. Ayd{\i}n Alatan

TL;DR
This paper introduces VIFT, a causal transformer-based approach for visual-inertial odometry that enhances pose estimation accuracy by leveraging attention mechanisms and specialized training techniques, achieving state-of-the-art results on KITTI.
Contribution
The paper proposes a novel causal transformer architecture for deep visual-inertial odometry that refines pose estimates using latent feature vectors and addresses data imbalance and rotation learning.
Findings
VIFT outperforms previous methods on KITTI dataset.
Transformer-based approach improves pose estimation accuracy.
End-to-end trainable with only monocular camera and IMU.
Abstract
In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Image and Object Detection Techniques
MethodsSoftmax · Attention Is All You Need
