FlowFormer: A Transformer Architecture and Its Masked Cost Volume   Autoencoding for Optical Flow

Zhaoyang Huang; Xiaoyu Shi; Chao Zhang; Qiang Wang; Yijin Li; Hongwei; Qin; Jifeng Dai; Xiaogang Wang; and Hongsheng Li

arXiv:2306.05442·cs.CV·June 12, 2023·1 cites

FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Yijin Li, Hongwei, Qin, Jifeng Dai, Xiaogang Wang, and Hongsheng Li

PDF

Open Access

TL;DR

FlowFormer introduces a transformer-based architecture with Masked Cost Volume AutoEncoding for improved optical flow estimation, achieving state-of-the-art results and better generalization on benchmark datasets.

Contribution

The paper presents a novel transformer architecture for optical flow and a pretraining scheme that leverages unlabeled data to enhance performance.

Findings

01

FlowFormer achieves 1.16 and 2.09 AEPE on Sintel, surpassing previous methods.

02

MCVA pretraining further improves FlowFormer's accuracy, reducing errors by over 7%.

03

FlowFormer+MCVA ranks 1st on Sintel and KITTI benchmarks.

Abstract

This paper introduces a novel transformer-based network architecture, FlowFormer, along with the Masked Cost Volume AutoEncoding (MCVA) for pretraining it to tackle the problem of optical flow estimation. FlowFormer tokenizes the 4D cost-volume built from the source-target image pair and iteratively refines flow estimation with a cost-volume encoder-decoder architecture. The cost-volume encoder derives a cost memory with alternate-group transformer~(AGT) layers in a latent space and the decoder recurrently decodes flow from the cost memory with dynamic positional cost queries. On the Sintel benchmark, FlowFormer architecture achieves 1.16 and 2.09 average end-point-error~(AEPE) on the clean and final pass, a 16.5\% and 15.5\% error reduction from the GMA~(1.388 and 2.47). MCVA enhances FlowFormer by pretraining the cost-volume encoder with a masked autoencoding scheme, which further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Advanced Image and Video Retrieval Techniques