A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

Huimin Wu; Kwang-Ting Cheng; Stephen Lin; Zhirong Wu

arXiv:2512.18684·cs.CV·December 23, 2025

A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

Huimin Wu, Kwang-Ting Cheng, Stephen Lin, Zhirong Wu

PDF

Open Access

TL;DR

This paper demonstrates that general-purpose video foundation models can be effectively fine-tuned for multi-view geometry tasks like optical flow, achieving state-of-the-art results with minimal architectural modifications.

Contribution

It shows that pretrained video transformers can be adapted to geometric tasks using simple linear decoders and iterative refinement, without task-specific pretraining or complex architectures.

Findings

01

Achieved top cross-dataset generalization for optical flow.

02

Set new records on online benchmarks for optical flow.

03

Strong performance in 3D depth estimation and stereo matching.

Abstract

This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques