Causal Transformer for Fusion and Pose Estimation in Deep Visual   Inertial Odometry

Yunus Bilge Kurt; Ahmet Akman; A. Ayd{\i}n Alatan

arXiv:2409.08769·cs.CV·September 16, 2024

Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry

Yunus Bilge Kurt, Ahmet Akman, A. Ayd{\i}n Alatan

PDF

Open Access 1 Repo

TL;DR

This paper introduces VIFT, a causal transformer-based approach for visual-inertial odometry that enhances pose estimation accuracy by leveraging attention mechanisms and specialized training techniques, achieving state-of-the-art results on KITTI.

Contribution

The paper proposes a novel causal transformer architecture for deep visual-inertial odometry that refines pose estimates using latent feature vectors and addresses data imbalance and rotation learning.

Findings

01

VIFT outperforms previous methods on KITTI dataset.

02

Transformer-based approach improves pose estimation accuracy.

03

End-to-end trainable with only monocular camera and IMU.

Abstract

In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ybkurt/vift
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Image and Object Detection Techniques

MethodsSoftmax · Attention Is All You Need