DAOVI: Distortion-Aware Omnidirectional Video Inpainting

Ryosuke Seshimo; Mariko Isogawa

arXiv:2509.00396·cs.CV·September 3, 2025

DAOVI: Distortion-Aware Omnidirectional Video Inpainting

Ryosuke Seshimo, Mariko Isogawa

PDF

Open Access

TL;DR

This paper introduces DAOVI, a deep learning model specifically designed for inpainting omnidirectional videos, effectively handling geometric distortions and preserving spatial-temporal consistency.

Contribution

DAOVI is the first inpainting model that explicitly accounts for distortion in equirectangular projections of omnidirectional videos, incorporating geodesic-aware motion and depth-aware feature propagation modules.

Findings

01

DAOVI outperforms existing inpainting methods quantitatively.

02

DAOVI produces more visually coherent inpainted omnidirectional videos.

03

The model effectively handles geometric distortions in equirectangular projections.

Abstract

Omnidirectional videos that capture the entire surroundings are employed in a variety of fields such as VR applications and remote sensing. However, their wide field of view often causes unwanted objects to appear in the videos. This problem can be addressed by video inpainting, which enables the natural removal of such objects while preserving both spatial and temporal consistency. Nevertheless, most existing methods assume processing ordinary videos with a narrow field of view and do not tackle the distortion in equirectangular projection of omnidirectional videos. To address this issue, this paper proposes a novel deep learning model for omnidirectional video inpainting, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI). DAOVI introduces a module that evaluates temporal motion information in the image space considering geodesic distance, as well as a depth-aware…

Tables2

Table 1. Table 1: Quantitative comparison on ODV360 [ Cao et al.(2023) ] dataset. The best result of each metric is marked in bold font.

Method	PSNR[dB]( $↑$ )	SSIM( $↑$ )	WS-PSNR[dB]( $↑$ )	WS-SSIM( $↑$ )	VFID( $↓$ )
FuseFormer	29.18	0.9727	28.87	0.9392	0.320
STTN	32.21	0.9815	31.09	0.9520	0.238
ProPainter	32.81	0.9877	32.86	0.9648	0.149
Ours	33.37	0.9888	33.37	0.9661	0.138

Table 2. Table 2: Ablation Study. The best result of each metric is marked in bold font.

Method	PSNR[dB]( $↑$ )	SSIM( $↑$ )	WS-PSNR[dB]( $↑$ )	WS-SSIM( $↑$ )	VFID( $↓$ )
w/o GFCIP	33.25	0.9884	33.25	0.9656	0.136
w/o ODAFP	32.84	0.9878	32.89	0.9650	0.148
Ours	33.37	0.9888	33.37	0.9661	0.138

Equations19

X^{\prime}_{t}=\mathcal{W}(X_{t+1},F_{t\to t+1})*M_{r}+X_{t}*\bigl{(}1-M_{r}\bigr{)}

X^{\prime}_{t}=\mathcal{W}(X_{t+1},F_{t\to t+1})*M_{r}+X_{t}*\bigl{(}1-M_{r}\bigr{)}

M_{r} (p) = {10 if p \in C_{1} \cap C_{2} \cap C_{3} otherwise.

M_{r} (p) = {10 if p \in C_{1} \cap C_{2} \cap C_{3} otherwise.

p^{\prime}=p+F_{t\to t+1}(p)+F_{t+1\to t}\bigl{(}p+F_{t\to t+1}(p)\bigr{)}

p^{\prime}=p+F_{t\to t+1}(p)+F_{t+1\to t}\bigl{(}p+F_{t\to t+1}(p)\bigr{)}

E(a,b)=\arccos\bigl{(}\cos\theta_{a}\,\cos\theta_{b}\,\cos(\phi_{a}-\phi_{b})+\sin\theta_{a}\,\sin\theta_{b}\bigr{)}

E(a,b)=\arccos\bigl{(}\cos\theta_{a}\,\cos\theta_{b}\,\cos(\phi_{a}-\phi_{b})+\sin\theta_{a}\,\sin\theta_{b}\bigr{)}

ϕ_{p}

ϕ_{p}

θ_{p}

C_{1}:\quad E\bigl{(}p,\,p^{\prime}\bigr{)}<\epsilon

C_{1}:\quad E\bigl{(}p,\,p^{\prime}\bigr{)}<\epsilon

C_{2} : M_{t} (p) = 1

C_{2} : M_{t} (p) = 1

C_{3}:M_{t+1}\bigl{(}p+F_{t\to t+1}(p)\bigr{)}=0

C_{3}:M_{t+1}\bigl{(}p+F_{t\to t+1}(p)\bigr{)}=0

w_{\mathrm{erp}}(i,j)=\cos\!\biggl{(}\frac{(j+0.5-N/2)\,\pi}{N}\biggr{)}

w_{\mathrm{erp}}(i,j)=\cos\!\biggl{(}\frac{(j+0.5-N/2)\,\pi}{N}\biggr{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies · Advanced Vision and Imaging

Full text

\addauthor

Ryosuke [email protected] \addauthorMariko [email protected] \addinstitution Keio University, Japan

DAOVI

DAOVI: Distortion-Aware Omnidirectional Video Inpainting

Abstract

Omnidirectional videos that capture the entire surroundings are employed in a variety of fields such as VR applications and remote sensing. However, their wide field of view often causes unwanted objects to appear in the videos. This problem can be addressed by video inpainting, which enables the natural removal of such objects while preserving both spatial and temporal consistency. Nevertheless, most existing methods assume processing ordinary videos with a narrow field of view and do not tackle the distortion in equirectangular projection of omnidirectional videos. To address this issue, this paper proposes a novel deep learning model for omnidirectional video inpainting, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI). DAOVI introduces a module that evaluates temporal motion information in the image space considering geodesic distance, as well as a depth-aware feature propagation module in the feature space that is designed to address the geometric distortion inherent to omnidirectional videos. The experimental results demonstrate that our proposed method outperforms existing methods both quantitatively and qualitatively.

1 Introduction

Omnidirectional videos that capture the entire scene from every direction can provide users with an immersive experience and attract attention for use in VR/AR applications and remote sensing [Cao et al.(2023), Wieland et al.(2012)Wieland, Pittore, Parolai, and Zschau]. Unlike ordinary videos with a narrow field of view (FoV), omnidirectional videos can represent the entire surroundings. As a result, camera operators need not worry much about adjusting the camera angle to keep the subject in frame. On the other hand, the wide FoV often causes unwanted objects to appear in the videos.

Video inpainting is a well-known technique for removing undesired regions inadvertently captured in videos. This technique designates the regions to be removed as masked regions and reconstructs them seamlessly to match the surrounding content. In particular, deep learning-based approaches [Hu et al.(2020)Hu, Wang, Ballas, Grauman, and Schwing, Zhou et al.(2023)Zhou, Li, Chan, and Loy] have emerged in recent years, demonstrating their capability to achieve high-quality completions.

Many of the recently effective approaches incorporate optical flow based propagation [Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy]. In these methods, the masked regions in the target frame are filled by propagating information from reference frames. This enables natural completion while maintaining spatial and temporal consistency. Moreover, considering the influence of optical flow estimation errors in masked regions on the inpainting results, a depth-guided video inpainting method [Li et al.(2023)Li, Zhu, Ge, Zeng, Imran, Abbasi, and Cooper] has also been proposed. This method uses depth estimation that is less prone to errors in masked regions to guide the inpainting process. However, despite these advances, most existing video inpainting methods [Xu et al.(2019)Xu, Li, Zhou, and Loy, Zhang et al.(2022a)Zhang, Fu, and Liu, Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy, Zeng et al.(2020)Zeng, Fu, and Chao, Li et al.(2023)Li, Zhu, Ge, Zeng, Imran, Abbasi, and Cooper, Liu et al.(2021)Liu, Deng, Huang, Shi, Lu, Sun, Wang, Dai, and Li] are designed for ordinary videos with a narrow FoV and are not suitable for omnidirectional videos, which include significant distortion due to equirectangular projection (ERP). Fig. 1 shows that applying these video inpainting methods to omnidirectional videos introduces artifacts, making it difficult to obtain plausible inpainting results.

Although methods have been proposed for omnidirectional video inpainting that account for distortion, many of these approaches [Kawai et al.(2010)Kawai, Machikita, Sato, and Yokoya, Kawai et al.(2014)Kawai, Inoue, Sato, Okura, Nakashima, and Yokoya, Xu et al.(2016)Xu, Pathak, Fujii, Yamashita, and Asama] impose strict requirements for input videos, such as requiring a static background or planar masked regions. As a result, it is challenging to apply these methods in a variety of scenarios.

To address this issue, this paper proposes a deep learning-based omnidirectional video inpainting method, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI). DAOVI consists of two main modules tailored for omnidirectional videos. First, we introduce the Geodesic Flow-Consistent Image Propagation (GFCIP) module. Conventional flow-based methods [Xu et al.(2019)Xu, Li, Zhou, and Loy, Gao et al.(2020)Gao, Saraf, Huang, and Kopf, Zhou et al.(2023)Zhou, Li, Chan, and Loy] compute the sum of the forward and backward optical flows at each pixel and evaluate flow validity by checking whether the sum is below a threshold. Only valid flows are used to propagate pixel values in the image space. However, distortion in omnidirectional videos varies with position, so using a single threshold is no longer reasonable. Instead, GFCIP evaluates flow validity by checking the geodesic distance on the unit sphere.

Second, we propose the Omnidirectional Depth-Assisted Feature Propagation (ODAFP) module. Conventional approaches [Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy] propagate features in the latent space from adjacent frames using deformable convolutional networks (DCN) [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei, Zhu et al.(2019)Zhu, Hu, Lin, and Dai], but they do not account for ERP distortion. ODAFP generates DCN offsets and modulation masks tailored to omnidirectional videos by using convolutions and padding schemes designed for $360\tcdegree$ images and a distortion map that indicates the amount of ERP distortion at each pixel. Furthermore, inspired by the finding that depth guidance can significantly enhance video inpainting [Li et al.(2023)Li, Zhu, Ge, Zeng, Imran, Abbasi, and Cooper], depth maps are used as additional input to complement flow based propagation.

In summary, we make the following contributions:

•

We propose a deep learning-based video inpainting method for omnidirectional videos, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI).

•

In the image space, we introduce the Geodesic Flow-Consistent Image Propagation (GFCIP) module, which evaluates optical flow validity using geodesic distance.

•

In the feature space, we introduce the Omnidirectional Depth-Assisted Feature Propagation (ODAFP) module that performs propagation using distortion guided modulation and convolutions designed for omnidirectional images, and uses depth maps as additional input to complement optical flow.

2 Related Work

Video Inpainting. Video inpainting is a technique for removing specified regions from a video and seamlessly restoring them. Various methods have been proposed in recent years [Zeng et al.(2020)Zeng, Fu, and Chao, Liu et al.(2021)Liu, Deng, Huang, Shi, Lu, Sun, Wang, Dai, and Li, Lee et al.(2019)Lee, Oh, Won, and Kim, Chang et al.(2019)Chang, Liu, Lee, and Hsu, Hu et al.(2020)Hu, Wang, Ballas, Grauman, and Schwing, Wang et al.(2019)Wang, Huang, Han, and Wang]. Among them, many recent approaches utilize optical flow [Xu et al.(2019)Xu, Li, Zhou, and Loy, Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy, Gao et al.(2020)Gao, Saraf, Huang, and Kopf, Zhang et al.(2022b)Zhang, Fu, and Liu, Zhang et al.(2022a)Zhang, Fu, and Liu]. For example, Xu et al\bmvaOneDotintroduced an optical flow based video inpainting method called DFVI [Xu et al.(2019)Xu, Li, Zhou, and Loy]. In this approach, rather than directly predicting RGB pixel values in the masked region, the method first estimates the optical flow between adjacent frames. Based on the estimated flow, pixel values from surrounding frames are then propagated to fill the masked regions. By leveraging motion information, DFVI can accommodate complex object movements. However, it does not correct the propagation errors in pixel values in later processing modules.

To address this issue, Li et al\bmvaOneDotproposed E2FGVI [Li et al.(2022)Li, Lu, Qin, Guo, and Cheng], which is a video inpainting method inspired by optical flow based video super-resolution methods [Chan et al.(2021a)Chan, Wang, Yu, Dong, and Loy, Chan et al.(2022)Chan, Zhou, Xu, and Loy]. Their method encodes input frames into feature maps instead of processing them in the image space, and uses downsampled optical flow to guide feature propagation between adjacent frames. To enable accurate propagation of feature maps, E2FGVI employs DCN [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei, Zhu et al.(2019)Zhu, Hu, Lin, and Dai] for feature refinement. Following BasicVSR++ [Chan et al.(2022)Chan, Zhou, Xu, and Loy], this approach is motivated by the observation that offset diversity improves performance [Chan et al.(2021b)Chan, Wang, Yu, Dong, and Loy]. However, even in regions where the optical flow estimation is accurate, the use of downsampled flow limits the achievable propagation accuracy.

To overcome this limitation, Zhou et al\bmvaOneDotproposed ProPainter [Zhou et al.(2023)Zhou, Li, Chan, and Loy], which adaptively combines image space propagation and feature space propagation. Their method first evaluates the validity of the optical flow using bidirectional consistency. Only regions with high flow validity are processed in the image space, allowing precise pixel-level propagation. This process partially fills the masked regions. Subsequently, feature space propagation is applied to complete the remaining masked regions. By integrating both propagation strategies, ProPainter effectively leverages accurate pixel information from regions with high flow validity via precise image propagation, while mitigating the adverse effects of flow errors through feature propagation.

As an alternative to optical flow-based approaches, Li et al\bmvaOneDotproposed a video inpainting method called DGDVI [Li et al.(2023)Li, Zhu, Ge, Zeng, Imran, Abbasi, and Cooper], which utilizes depth maps as guidance. Because optical flow may be unreliable in masked regions due to high temporal variability, they instead used depth maps as more reliable guidance, since depth maps tend to be more temporally stable.

As described above, various video inpainting methods have been proposed, but they are designed for conventional videos with a narrow FoV and do not address the distortion in omnidirectional videos. Consequently, these methods cannot be considered optimal for omnidirectional video inpainting.

Omnidirectional Inpainting. Several inpainting methods tailored for omnidirectional content have been proposed. For example, Gkitas et al\bmvaOneDot [Gkitsas et al.(2021)Gkitsas, Sterzentsenko, Zioulis, Albanis, and Zarpalas] and Pintore et al\bmvaOneDot [Pintore et al.(2022)Pintore, Agus, Almansa, and Gobbetti] proposed omnidirectional image inpainting methods that incorporate padding strategies considering the horizontal cyclic continuity, as well as loss functions that take into account the distortion of omnidirectional images. In addition, Gao et al\bmvaOneDot [Gao et al.(2023)Gao, Chen, Su, and Chu] proposed a method that preserves structural information of indoor scenes in omnidirectional images by leveraging a layout estimation model specifically designed for omnidirectional images.

Inpainting methods have been proposed not only for omnidirectional images but also for omnidirectional videos. For example, Kawai et al\bmvaOneDot [Kawai et al.(2010)Kawai, Machikita, Sato, and Yokoya] proposed an omnidirectional video inpainting method to fill regions that the omnidirectional camera systems of that era could not capture. Their method leverages structure-from-motion specifically adapted for omnidirectional videos to estimate 3D point clouds. These point clouds are then used to identify regions in other frames that correspond to the missing region in the current frame, enabling the inpainting process. Kawai et al\bmvaOneDot [Kawai et al.(2014)Kawai, Inoue, Sato, Okura, Nakashima, and Yokoya] proposed a method that automatically removes dynamic objects from omnidirectional videos by leveraging structure-from-motion. Xu et al\bmvaOneDot [Xu et al.(2016)Xu, Pathak, Fujii, Yamashita, and Asama] observed that distortion in ERP increases toward the poles. To address this, they rotate each frame so that the masked regions are positioned near the equator, where distortion is minimal, before performing optical flow based inpainting. Choi et al\bmvaOneDot [Choi et al.(2024)Choi, Jang, and Kim] proposed a method based on Neural Radiance Fields (NeRF) [Mildenhall et al.(2021)Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, and Ng] for removing dynamic objects from omnidirectional videos.

However, all of these omnidirectional video inpainting methods assume that masked regions are visible in other frames. In addition, the method by Kawai et al\bmvaOneDot [Kawai et al.(2010)Kawai, Machikita, Sato, and Yokoya] is limited to inpainting the ground region, while the method by Xu et al\bmvaOneDot [Xu et al.(2016)Xu, Pathak, Fujii, Yamashita, and Asama] requires a static background. These additional constraints limit their applicability. Moreover, Kawai et al\bmvaOneDot [Kawai et al.(2014)Kawai, Inoue, Sato, Okura, Nakashima, and Yokoya] and Choi et al\bmvaOneDot [Choi et al.(2024)Choi, Jang, and Kim] focus on removing dynamic objects to recover a static scene and do not address the removal of static elements such as graffiti on walls. In contrast, our method imposes no such constraints and enables inpainting of any user-specified region while preserving spatial and temporal consistency.

3 Method

3.1 Overview

This paper proposes Distortion-Aware Omnidirectional Video Inpainting (DAOVI), a deep learning model for video inpainting that accounts for distortion in omnidirectional videos. The overall framework is shown in Fig. 2. DAOVI takes as input the video frame sequence $\bm{X}=[X_{1},X_{2},\dots,X_{T}]$ and the corresponding mask sequence $\bm{M}=[M_{1},M_{2},\dots,M_{T}]$ , where $T$ denotes the total number of frames.

To ensure spatial and temporal consistency, DAOVI uses optical flow, following high-performing deep learning-based video inpainting methods [Xu et al.(2019)Xu, Li, Zhou, and Loy, Gao et al.(2020)Gao, Saraf, Huang, and Kopf, Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy]. Optical flow $\bm{F}=\{F^{f},F^{b}\}$ is estimated from the video frame sequence $\bm{X}$ and the mask sequence $\bm{M}$ using a pre-trained flow estimation model [Teed and Deng(2020)] and a flow completion network [Zhou et al.(2023)Zhou, Li, Chan, and Loy], where $F^{f}=\{F^{f}_{t}=F_{t\to t+1}\}^{T-1}_{t=1}$ and $F^{b}=\{F^{b}_{t}=F_{t+1\to t}\}^{T-1}_{t=1}$ represent the forward and backward flows, respectively.

Next, the Geodesic Flow-Consistent Image Propagation (GFCIP) module uses only reliable vectors in $\bm{F}$ to propagate pixel values from adjacent frames in the image domain. This module produces a partially inpainted output $\bm{X}^{\prime}=[X^{\prime}_{1},X^{\prime}_{2},\dots,X^{\prime}_{T}]$ . Note that the validity of the flow vectors is assessed by computing the geodesic distance to ensure suitability for omnidirectional videos. Since regions with low flow validity remain unfilled, these areas are inpainted in the feature space by passing $\bm{X}^{\prime}$ through a $9$ -layer CNN encoder to produce the feature maps $\bm{e}=[e_{1},e_{2},\dots,e_{T}]$ .

Following GFCIP, the Omnidirectional Depth-Assisted Feature Propagation (ODAFP) module propagates information from adjacent frames in the feature space. Specifically, ODAFP employs DCN [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei, Zhu et al.(2019)Zhu, Hu, Lin, and Dai] to propagate features from adjacent feature maps, guided by optical flow. To avoid relying exclusively on optical flow, ODAFP also incorporates depth information. The depth map $\bm{D}=[D_{1},D_{2},\dots,D_{T}]$ is estimated from the video frame sequence $\bm{X}$ and the mask sequence $\bm{M}$ via a pre-trained omnidirectional depth estimation model [Wang and Liu(2024)] and a depth completion network [Li et al.(2023)Li, Zhu, Ge, Zeng, Imran, Abbasi, and Cooper]. The parameters of both networks remain fixed during DAOVI training. To integrate depth information into feature space, a 2-layer CNN encoder generates depth feature maps $\bm{d}=[d_{1},d_{2},\dots,d_{T}]$ . Furthermore, to account for the geometric distortion inherent in omnidirectional videos during propagation, DCN offsets and modulation masks are weighted by a distortion map [Sun et al.(2017)Sun, Lu, and Yu], which represents per-pixel ERP distortion.

We incorporate a transformer-based module to further enhance the model’s expressive capacity. Specifically, we adopt the Mask-Guided Sparse Video Transformer [Zhou et al.(2023)Zhou, Li, Chan, and Loy], a transformer-based module that is computationally efficient. Finally, a $4$ -layer CNN decoder reconstructs the fully inpainted video frames $\hat{\bm{X}}=[\hat{X}_{1},\hat{X}_{2},\dots,\hat{X}_{T}]$ . The following sections describe the two core modules (GFCIP and ODAFP) in detail.

3.2 Geodesic Flow-Consistent Image Propagation (GFCIP)

The output $X^{\prime}_{t}$ of the GFCIP module is computed from the input frame $X_{t}$ , its next frame $X_{t+1}$ , and the optical flow $F_{t\to t+1}$ representing the motion between frame $X_{t}$ and $X_{t+1}$ . This output can be formulated as

[TABLE]

where $\mathcal{W}(\cdot)$ is the warping operation. The binary mask $M_{r}$ identifies those pixels within the masked region that satisfy the required flow validity for reliable propagation. For a pixel position $p$ , the binary mask $M_{r}(p)$ is defined as

[TABLE]

The conditions $C_{1}$ , $C_{2}$ , and $C_{3}$ are defined as follows. Condition $C_{1}$ tests whether the optical flow at a given pixel is sufficiently reliable. Since the flow is estimated by a pre-trained model and may contain errors, this check is required. The position $p^{\prime}$ is obtained by first mapping a pixel at $p$ from frame $X_{t}$ to $X_{t+1}$ via the optical flow $F_{t\to t+1}$ , and then mapping it back from frame $X_{t+1}$ to $X_{t}$ via $F_{t+1\to t}$ . It can be expressed as

[TABLE]

If the optical flow were perfectly accurate, those two mappings would cancel out, and $p^{\prime}$ would coincide with $p$ . Accordingly, the distance between $p$ and $p^{\prime}$ quantifies the magnitude of the flow consistency error. As shown in Fig. 3, Euclidean distance in ERP pixel coordinates does not coincide with the actual distance. Therefore, this distance is evaluated using geodesic distance. We denote this geodesic distance by $E(\cdot)$ and compute it as

[TABLE]

where $\phi$ and $\theta$ are the spherical coordinates and $W$ and $H$ denote the width and height of the input video frames, respectively. Using this notation, $C_{1}$ is formalized as

[TABLE]

where $\epsilon$ is the threshold, which we set to $0.4^{\circ}$ .

Condition $C_{2}$ signifies that the pixel position $p$ in frame $X_{t}$ is located in the masked region, which can be expressed as

[TABLE]

This condition ensures that the inpainting target region $M_{r}$ is a subset of the mask.

Condition $C_{3}$ requires that the position reached by moving $p$ according to the optical flow $F_{t\to t+1}$ lies outside the mask in frame $t+1$ , which can be expressed as

[TABLE]

This condition ensures that the source location for propagation is unmasked.

3.3 Omnidirectional Depth-Assisted Feature Propagation (ODAFP)

Fig. 4 shows the ODAFP module. ODAFP propagates features from adjacent frames in the feature space using DCN [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei, Zhu et al.(2019)Zhu, Hu, Lin, and Dai] and optical flow, following high-performing existing video inpainting approaches [Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy]. Specifically, ODAFP first aligns the adjacent frame output $\hat{e}_{t+1}$ to the current frame $t$ using a DCN guided by the optical flow $F_{t+1\to t}$ . The aligned feature map and current frame feature map $e_{t}$ are then concatenated and passed through convolutional layers to produce the output of this module $\hat{e}_{t}$ . Similarly, feature propagation using the forward optical flow $F_{t\to t+1}$ is performed. For simplicity, the following description focuses on propagation with the backward optical flow $F_{t+1\to t}$ .

DCN requires offsets and modulation masks. To generate these, the following components are concatenated and fed through convolutional layers: the current feature map $e_{t}$ , the mask $M_{t}$ , the unfilled region mask $\hat{M}_{t}$ from GFCIP, backward optical flow $F_{t+1\to t}$ , its validity map $V_{t+1\to t}$ , the warped adjacent frame output $\mathcal{W}\bigl{(}\hat{e}_{t+1}\bigr{)}$ , and the depth feature map $d_{t}$ . Including $d_{t}$ avoids relying solely on optical flow and strengthens robustness.

The concatenated features are then processed by a sequence of 4 convolutional layers. In one of these layers, ODAFP uses the Adaptively Combined Dilated Convolution (ACDConv) [Zhuang et al.(2021)Zhuang, Lu, Wang, Xiao, and Wang], which is specifically designed for omnidirectional images. ACDConv dynamically weights and combines the outputs of multiple dilated convolution layers [Yu and Koltun(2016)], which enables the operation to adapt to the spatially varying distortions in omnidirectional videos.

Additionally, instead of standard zero padding, circular padding [Wang et al.(2018b)Wang, Huang, Lin, Hu, Zeng, and Sun] is applied in the convolutional layers. This padding method preserves the cyclic continuity between the leftmost and rightmost columns, and also preserves continuity at the poles in ERP images.

Moreover, to better adapt the offsets and modulation masks to omnidirectional videos, we utilize the ERP distortion weighting formula [Sun et al.(2017)Sun, Lu, and Yu]. The per-pixel ERP distortion weight $w_{\mathrm{erp}}(i,j)$ at coordinate $(i,j)$ is defined as

[TABLE]

where $N$ denotes the height of the ERP image. As shown in Equation 10, the ERP distortion magnitude $w_{\mathrm{erp}}(i,j)$ is determined solely by the pixel’s y-coordinate $j$ . A per-pixel distortion map is constructed by applying $w_{\mathrm{erp}}(i,j)$ at each coordinate. As shown in Fig. 5, the per-pixel distortion weight $w_{\mathrm{erp}}$ is larger near the equator and smaller toward the poles, taking values close to $1$ around the equator and tapering toward [math] near the poles. This indicates that distortion due to the equirectangular projection is minimal near the equator and increases toward the poles.

To effectively utilize the distortion map, we employ the Distortion Guidance Generator (DGG) module [Yang et al.(2025)Yang, Dong, Xiao, Zhang, Lam, Zhou, and Qiu]. DGG encodes the per-pixel distortion map into the latent space and produces distortion guidance $G$ . This distortion guidance $G$ is used to weight the DCN offsets and modulation masks. The weighted offsets are then added to the optical flow $F_{t+1\to t}$ , and the weighted modulation masks are passed through a sigmoid function before being applied in the DCN.

4 Experiments

In this section, we describe the experiments conducted to validate the effectiveness of the proposed method. Implementation details, additional ablation study and additional qualitative results are provided in the supplementary material.

4.1 Experimental Settings

Dataset. In this study, we used the ODV360 omnidirectional video dataset [Cao et al.(2023)]. ODV360 contains 210 videos for training, 20 for validation, and 20 for testing, with each video consisting of 100 frames. We adhered to this split for model training and evaluation. Although the original videos in ODV360 have a resolution of $540\times 270$ pixels, we downsampled them to $304\times 152$ pixels due to computational resource constraints.

Since ODV360 does not provide mask frames for the video inpainting task, we generated random mask sequences for evaluation, as used for quantitative evaluation in existing video inpainting methods [Li et al.(2022)Li, Lu, Qin, Guo, and Cheng, Zhou et al.(2023)Zhou, Li, Chan, and Loy]. Unlike the random masks used in prior video inpainting methods, which are typically single and static, our evaluation uses sequences composed of multiple randomly shaped regions that move independently across frames. This evaluation setting spatially disperses mask positions because distortion varies with coordinates in omnidirectional videos, as noted in Section 3.3, in order to ensure a fair comparison. The same random masks are used for qualitative evaluation.

Network and Training. We train our model using the Adam [Kingma and Ba(2014)] optimizer with a batch size of 8, setting the learning rate to $1.5\times 10^{-4}$ and running $80$ k iterations. Our implementation is based on the PyTorch framework, and training is performed on a single NVIDIA RTX 6000 Ada GPU.

Evaluation Metrics. For the quantitative evaluation of inpainting results, we employed five metrics: PSNR, SSIM [Wang et al.(2004)Wang, Bovik, Sheikh, and Simoncelli], WS-PSNR [Sun et al.(2017)Sun, Lu, and Yu], WS-SSIM [Zhou et al.(2018)Zhou, Yu, Ma, Shao, and Jiang], and VFID [Wang et al.(2018a)Wang, Liu, Zhu, Liu, Tao, Kautz, and Catanzaro]. WS-PSNR and WS-SSIM are weighted versions of PSNR and SSIM [Wang et al.(2004)Wang, Bovik, Sheikh, and Simoncelli], respectively, which account for the positional distortion inherent in omnidirectional videos. VFID is an extension of the image quality metric FID [Heusel et al.(2017)Heusel, Ramsauer, Unterthiner, Nessler, and Hochreiter] to videos, and it evaluates perceptual video quality similarity using a pre-trained CNN model [Carreira and Zisserman(2017), Xie et al.(2016)Xie, Girshick, Dollár, Tu, and He].

4.2 Comparison with Baseline Methods

Quantitative Results. We report quantitative results on the ODV360 dataset [Cao et al.(2023)]. In this experiment, we compared our method with several state-of-the-art (SOTA) video inpainting methods, including FuseFormer [Liu et al.(2021)Liu, Deng, Huang, Shi, Lu, Sun, Wang, Dai, and Li], STTN [Zeng et al.(2020)Zeng, Fu, and Chao], and ProPainter [Zhou et al.(2023)Zhou, Li, Chan, and Loy], to evaluate the performance of our method. As shown in Tab. 1, our DAOVI surpasses existing video inpainting methods on all evaluation metrics. The high PSNR and SSIM scores indicate high reconstruction fidelity, while the low VFID score demonstrates perceptually plausible inpainted frames. Moreover, the high WS-PSNR and WS-SSIM confirm that DAOVI exhibits excellent performance for omnidirectional videos.

Qualitative Results. For the visual comparison, we compare our method with other video inpainting approaches, including FuseFormer [Liu et al.(2021)Liu, Deng, Huang, Shi, Lu, Sun, Wang, Dai, and Li], STTN [Zeng et al.(2020)Zeng, Fu, and Chao], and ProPainter [Zhou et al.(2023)Zhou, Li, Chan, and Loy]. As shown in Fig. 6, applying methods designed for narrow-FoV videos to omnidirectional inputs produces noticeable artifacts and unsatisfactory reconstructions. In contrast, our DAOVI produces visually plausible results. This demonstrates the effectiveness of the proposed method for omnidirectional video inpainting.

4.3 Ablation Study

An ablation study is conducted to verify the effectiveness of the proposed modules. As shown in Tab. 2, removing either GFCIP or ODAFP leads to degraded scores relative to the full model. These results indicate that both modules enhance omnidirectional video inpainting performance. The larger score drop when ODAFP is removed, which highlights its stronger contribution. By contrast, GFCIP contributes less to performance improvement than ODAFP, but can be incorporated without introducing any additional learnable parameters.

5 Conclusion

In this work, we propose a deep learning–based omnidirectional video inpainting framework called DAOVI. The proposed framework includes modules that perform distortion-aware propagation in both the image space and the feature space to address the unique geometry of omnidirectional videos. In the image space, pixel values are propagated only along reliable optical flow. Reliability of optical flow is assessed by the consistency error between forward and backward flows, measured in geodesic distance rather than Euclidean distance in ERP pixel coordinates. In the feature space, to adapt to ERP specific distortion, distortion aware convolution is employed. In addition, intermediate features of the model are weighted with a distortion map that encodes the per pixel magnitude of ERP distortion. Furthermore, to avoid excessive reliance on the accuracy of the estimated optical flow, the estimated depth map is also used as an input to the model. Experimental results demonstrate that DAOVI outperforms SOTA video inpainting methods designed for ordinary videos with a narrow field of view in both quantitative and qualitative evaluations.

Acknowledgment. This work was partially supported by JSPS KAKENHI(A) 25H01159.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Cao et al.(2023)] Mingdeng Cao et al. Ntire 2023 challenge on 360° omnidirectional image and video super-resolution: Datasets, methods and results. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , pages 1731–1745, 2023.
2[Carreira and Zisserman(2017)] João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4724–4733, 2017.
3[Chan et al.(2021 a)Chan, Wang, Yu, Dong, and Loy] Kelvin C.K. Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4945–4954, 2021 a.
4[Chan et al.(2021 b)Chan, Wang, Yu, Dong, and Loy] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Understanding deformable alignment in video super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence , 2021 b.
5[Chan et al.(2022)Chan, Zhou, Xu, and Loy] Kelvin C.K. Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basic VSR++: Improving video super-resolution with enhanced propagation and alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 5962–5971, 2022.
6[Chang et al.(2019)Chang, Liu, Lee, and Hsu] Ya-Liang Chang, Zhe Yu Liu, Kuan-Ying Lee, and Winston Hsu. Free-form video inpainting with 3d gated convolution and temporal patchgan. IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9065–9074, 2019.
7[Choi et al.(2024)Choi, Jang, and Kim] Dongyoung Choi, Hyeonjoong Jang, and Min H. Kim. Omnilocalrf: Omnidirectional local radiance fields from dynamic videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6871–6880, 2024.
8[Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE/CVF International Conference on Computer Vision (ICCV) , pages 764–773, 2017.