WAFT: Warping-Alone Field Transforms for Optical Flow
Yihan Wang, Jia Deng

TL;DR
WAFT introduces a simple, high-resolution warping-based approach for optical flow that outperforms traditional cost volume methods in accuracy, speed, and generalization, challenging established design principles.
Contribution
WAFT replaces cost volume with high-resolution warping, achieving superior accuracy and efficiency in optical flow estimation with minimal inductive biases.
Findings
Ranks 1st on Spring, Sintel, and KITTI benchmarks.
Achieves best zero-shot generalization on KITTI.
Faster than existing methods with comparable accuracy.
Abstract
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being 1.3-4.1x faster than existing methods that have competitive accuracy (e.g., 1.3x than Flowformer++, 4.1x than CCMR+). Code and model weights are available at \href{https://github.com/princeton-vl/WAFT}{https://github.com/princeton-vl/WAFT}.
Peer Reviews
Decision·ICLR 2026 Oral
1. Removing the cost volume is a good contribution, which may make estimation of optical flow on high-res images much more feasible.
1. No detailed evaluation of how the model performs on large displacements, on which WAFT might be slightly weaker than models using a cost volume. The authors could make artificial displacements to stress-test WAFT to see where its limit lies. 2. No details are given for the Recurrent Update Module. For example, how many layers (esp. self attention layers), what's the total param count?
1. This paper is well written. 2. This paper has a clear and extensive ablation study to show the effectiveness of each design choice. 3. The author introduced an attention-based updater to replace the cost volume for feature similarity computation. This design is resonable and novel.
1. Since the author replaces the commonly used CNN updater with attention-based one, it is better to provides more details of the layers. 2. For the models used in Table 2, what is the downsampled ratio? And which line corresponds to the statement in the abstract "while being up to 4.1× faster than methods with similar performance". From my understanding, WAFT-Twins-a2 uses the same feature encoder as FlowFormer++, achieves similar performance but not significant speedup? 3. Can the authors prov
1. The design of WAFT without cost volume computation is very simple, flexible and effective, making it a significant contribution for computer vision research community 2. By avoiding cost volumes computation, WAFT can perform warping on original resolution feature maps, which can help achieving sharper boundary predictions in optical flow estimation 3. WAFT has shown best zero-shot cross-dataset generalization on KITTI, which is an important property towards generalization capability on unseen
1. The iterative recurrent update module may restrict the algorithm's potential for parallel optimization to achieve low latency. 2. WAFT relies on existing pre-trained vision foundation models, which may limits its potential for further computational efficiency improvement on feature extraction. 3. Compared with improved memory and computational efficiency improvement, the improvement on flow accuracy is relatively limited.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Industrial Vision Systems and Defect Detection · Image and Signal Denoising Methods
