FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow
Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Yijin Li, Hongwei, Qin, Jifeng Dai, Xiaogang Wang, and Hongsheng Li

TL;DR
FlowFormer introduces a transformer-based architecture with Masked Cost Volume AutoEncoding for improved optical flow estimation, achieving state-of-the-art results and better generalization on benchmark datasets.
Contribution
The paper presents a novel transformer architecture for optical flow and a pretraining scheme that leverages unlabeled data to enhance performance.
Findings
FlowFormer achieves 1.16 and 2.09 AEPE on Sintel, surpassing previous methods.
MCVA pretraining further improves FlowFormer's accuracy, reducing errors by over 7%.
FlowFormer+MCVA ranks 1st on Sintel and KITTI benchmarks.
Abstract
This paper introduces a novel transformer-based network architecture, FlowFormer, along with the Masked Cost Volume AutoEncoding (MCVA) for pretraining it to tackle the problem of optical flow estimation. FlowFormer tokenizes the 4D cost-volume built from the source-target image pair and iteratively refines flow estimation with a cost-volume encoder-decoder architecture. The cost-volume encoder derives a cost memory with alternate-group transformer~(AGT) layers in a latent space and the decoder recurrently decodes flow from the cost memory with dynamic positional cost queries. On the Sintel benchmark, FlowFormer architecture achieves 1.16 and 2.09 average end-point-error~(AEPE) on the clean and final pass, a 16.5\% and 15.5\% error reduction from the GMA~(1.388 and 2.47). MCVA enhances FlowFormer by pretraining the cost-volume encoder with a masked autoencoding scheme, which further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Advanced Image and Video Retrieval Techniques
