Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
Sanghyun Woo, Dahun Kim, KwanYong Park, Joon-Young Lee, In So Kweon

TL;DR
This paper introduces a novel feed-forward video inpainting network that combines alignment and non-local attention modules to achieve globally and locally coherent results, effectively handling large or slowly moving holes.
Contribution
The proposed network uniquely integrates alignment and non-local attention with recurrent propagation for improved temporal and spatial coherence in video inpainting.
Findings
Effective inpainting of large or slowly moving holes
Outperforms existing flow-based methods in coherence
Maintains temporal consistency in results
Abstract
We propose a novel feed-forward network for video inpainting. We use a set of sampled video frames as the reference to take visible contents to fill the hole of a target frame. Our video inpainting network consists of two stages. The first stage is an alignment module that uses computed homographies between the reference frames and the target frame. The visible patches are then aggregated based on the frame similarity to fill in the target holes roughly. The second stage is a non-local attention module that matches the generated patches with known reference patches (in space and time) to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. Therefore, even challenging scenes with large or slowly moving holes can be handled,…
| Multi-frame aggregation | Output Propagation | ||
| Align | Refine | Flow estimator | FID score |
| 8.966 (0.709) | |||
| ✓ | 8.139 (1.017) | ||
| ✓ | 8.577 (0.838) | ||
| ✓ | ✓ | 8.262 (1.615) | |
| ✓ | 7.515 (0.608) | ||
| ✓ | ✓ | 7.196 (1.863) | |
| ✓ | ✓ | 7.149 (1.753) | |
| ✓ | ✓ | ✓ | 5.775 (1.707) |
| Multi-frame aggregation | Output Propagation | ||
| Align | Refine | Flow estimator | FID score |
| 8.966 (0.709) | |||
| ✓ | 8.139 (1.017) | ||
| ✓ | 8.577 (0.838) | ||
| ✓ | ✓ | 8.262 (1.615) | |
| ✓ | 7.515 (0.608) | ||
| ✓ | ✓ | 7.196 (1.863) | |
| ✓ | ✓ | 7.149 (1.753) | |
| ✓ | ✓ | ✓ | 5.775 (1.707) |
| Matching method | FID score |
|---|---|
| entire-entire | 7.156 (1.818) |
| hole-nonhole | 5.775 (1.707) |
| Flow estimator | Warping error |
|---|---|
| 0.0027 (0.0001) | |
| ✓ | 0.0019 (0.0001) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging
\DeclareCaptionLabelFormat
singleparen#2
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
Sanghyun Woo
EE, KAIST
Daejeon, Korea
&Dahun Kim
EE, KAIST
Daejeon, Korea
&KwanYong Park
EE, KAIST
Daejeon, Korea
&Joon-Young Lee
Adobe Research
San Jose, CA, USA
&In So Kweon
EE, KAIST
Daejeon, Korea
Abstract
We propose a novel feed-forward network for video inpainting. We use a set of sampled video frames as the reference to take visible contents to fill the hole of a target frame. Our video inpainting network consists of two stages. The first stage is an alignment module that uses computed homographies between the reference frames and the target frame. The visible patches are then aggregated based on the frame similarity to fill in the target holes roughly. The second stage is a non-local attention module that matches the generated patches with known reference patches (in space and time) to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. Therefore, even challenging scenes with large or slowly moving holes can be handled, which have been hardly modeled by existing flow-based approach. Our network is also designed with a recurrent propagation stream to encourage temporal consistency in video results. Experiments on video object removal demonstrate that our method inpaints the holes with globally and locally coherent contents.
1 Introduction
Video inpainting aims to fill spatial-temporal holes with plausible content in a video. It is a practical and crucial problem as it could be beneficial for various video editing and restoration tasks. However, it is very challenging to maintain both spatial and temporal consistency; the inpainted contents must be spatially plausible, and temporally coherent at the same time.
Early works for video inpainting use a patch-based optimization technique [26, 7, 17, 8]. Among them, Huang et.al. [8] proposed a global flow field based optimization to preserve the temporal consistency, and show the state-of-the-art quality video results. However, the trade-off against the effectiveness is its limited practicality due to its intensive computational cost and vulnerability to noisy optical flows. Recently, two seminar works have proposed deep feed-forward methods for video inpainting [23, 12]. Wang et.al. [23] proposed CombCN by combining 3D and 2D CNNs, but their setting works on low-resolution videos with fixed square holes, limiting its application to general video object removal. Kim et.al. [12] proposed VINet which aggregates information by flow warping from nearby frames to the target frame. However, its internal dependency on the optical flow restricts the size of temporal search window, which sometimes leads to boundary artifacts and blurry textures inconsistent with global video contents.
To overcome the aforementioned issues, we propose a novel coarse-to-fine network for video inpainting. We use a set of sampled video frames as the reference to take visible contents to fill the hole of a target frame. Our proposed network consists of two stages. The first stage is an alignment module that uses computed homographies between the reference frames and the target frame. Despite being able to model only global transformations (e.g., affine, perspective), homography based alignment provides much larger temporal search window compared to the optical flow based counterpart,e.g., [12], as illustrated in Fig. 1. The visible patches are then aggregated to roughly fill in the target holes. The second stage consists of a non-local attention module that matches the generated patches with known reference patches in space and time, and a softmax that temporally pools the most relevant patches. This refinement stage compensates real motions that cannot be modeled by previous global transformations. Both stages consist in large spatial-temporal window size for the reference, and thus enable modeling long-range correlations between distant information and the hole regions. Therefore, even challenging scenes with large or slowly moving holes can be handled. Our network is also designed with a decoder to synthesize the contents that are never visible throughout the video, and a recurrent propagation stream to encourage temporal consistency in video results.
We show that our video results are more semantically plausible, and temporally smooth compared to the previous methods. Our model sequentially processes video frames of arbitrary length and runs at a near real-time rate.
2 Related Work
Traditional inpainting methods. Early works for image inpainting can broadly fall into either diffusion-based [1, 2, 15] or patch-based methods [3, 4, 5, 22]. The former propagates texture from the hole boundaries towards the hole center, and works well with small holes, but suffers artifacts and noisy results with large holes. The latter tries to match and copy the nearest neighbor background patches, and is widely deployed in practical applications.
For videos, Granados et.al. [6] and Newson et.al. [17] proposed to align the frames in addition to using the optical flow or 3D PatchMatch search. Huang et.al. [8] jointly optimize global flow and colors throughout a video for long-term temporal consistency. As mentioned earlier, these methods are heavy in computation time, prone to flow errors, and not able to capture high-level semantics.
Learning-based inpainting methods. Deep learning based methods have achieved great success on the image inpainting task [18, 9, 29, 16, 28]. They proposed to use Convolutional Neural Network together with Generative Adversarial Networks [18], global and local discriminators to improve spatial coherency [9], a coarse-to-fine model with contextual attention [29], partial convolution [16] and gated convolution [28] to handle free-form masks. However, they do not consider any consistencies between frames when applied to videos.
Recently, two deep learning based Methods have been proposed for video inpainting task [23, 12]. CombCN [23] has a 3D CNN with following 2D CNNs, where the 3D CNN part captures the temporal structure from low-resolution video. VINet [12] deals with real video object removal task by collecting visible information from nearby frames via flow warping. Nevertheless, their fundamental limitation is on their small spatial-temporal window size, which limits their performances for scenes with large and slowly moving holes.
3 Proposed Algorithm
Let be a set of video frames with spatial-temporal holes, and be the reconstructed ground truth frames. We aim to learn mappings , such that the prediction be as close as possible to the ground truth video , while being plausible and consistent in space and time. This can also be formulated as a conditional video generation task [24, 14] where we estimate the conditional . To simplify the problem, we base on a Markov assumption [24, 12] to factorize the conditional into a product form, such that the generation of the -th target frame is dependent on 1) current input frame , 2) two previous output frames , and 3) a set of sampled reference frames as:
[TABLE]
The main idea of our approach is to use a set of sampled reference frames, , that contains sufficiently large temporal search window, so that the visible information in the window can be fetched to inpaint the target frame with the globally coherent contents. According to our preliminary experiments, we sample every 10-th frames in a video to construct , and this provides much larger temporal window size than previous approaches [23, 12]. Another important design is to enforce each prediction to be temporally coherent with the past predictions, . With this recurrent pathway, our model runs in an auto-regressive manner. Our proposed method outperforms existing learning-based methods [23, 12], and performs on par with the optimization-based method [8] while running at much faster speed.
3.1 Network Design
The overview of our network is shown in Fig. 2. The whole architecture can be divided into three parallel pathways: homography estimator, align-and-attend video inpainter, and flow estimator.
3.1.1 Homography estimator
Given a set of reference frames and a target frame, the goal of homography estimator is to produce transformation parameters , which is to warp and align each reference frame onto the target frame.
Homography Encoder takes an image of size pixels as input, and produces an embedded feature map , where denotes the channel size . We use a same, shared encoder for both the reference and target frames. We denote features of any reference frame by , and those of the target frame by .
Masked matching produces a measure of similarity between the reference and target feature maps. We denote the matching function as , such that , where is a cosine similarity map computed between channel-wise normalized and . We constrain the matching to happen only between the visible parts to deal with the holes regions. To this end, we use downsampled binary inpainting masks and . With and denoting the spatial grid indices for and respectively, the correlation map is computed as:
[TABLE]
The similarity is normalized by the softmax over the spatial dimension of , for each .
Transformation estimator takes the correlation map as input and produces homography parameters between the reference and target. It is trained to output 6 parameters (*i.e.*affine transformation), such that , and .
3.1.2 Align-and-Attend Video Inpainter
Our video inpainter is an encoder-decoder model consisting of following components that are designed to reconstruct the target holes in a coarse-to-fine manner.
Image Encoder part follows the same architecture as in the homography estimator. Similarly, we denote encoded features of any reference frame by , and those of the target frame by , both of spatial size pixels.
Alignment stage is given the homography parameters computed as in Sec. 3.1.1, and accordingly aligns the reference feature maps onto the target feature map. We denote a reference feature map that is aligned to the target by , where , and denote frame index and the number of reference frames, respectively.
After the alignment is to pick up the most relevant reference feature points in the spatial-temporal search window. We present an aggregation function that can evaluate the alignment for each reference feature maps, and exclude irrelevant information such as newly introduced scene parts. We measure the Euclidean distances (*i.e.*L2-norm) between each aligned reference frames and the target frame, while ignoring the hole regions using the binary inpainting masks:
[TABLE]
where smaller value of represents better alignment of -th reference frame. The distance measure is used as a weighting coefficient, multiplied with corresponding inpainting masks . It is followed by softmax across temporal dimension to obtain a volume that weighs relevant pixels in the stack of and flattens the stack into one-frame feature map:
[TABLE]
We identify the visible region that can be borrowed from with reference frames by , and the initial coarse prediction of the target feature map is then obtained as:
[TABLE]
Refinement stage is designed to model pixel-wise correspondences [25], e.g., non-rigid motions, that cannot be covered by the previous global alignment stage. We propose to match the coarsely generated patches with the non-hole patches in the reference frame stack. The pixel-wise matching between the reference and target feature maps is described in Fig. 2. The proposed non-local attention picks up the most relevant and best matching patches in the spatial-temporal search window, and aggregates them into the target hole regions to make a refined prediction. This module is designed to be non-parametric, not requiring any embedding layers.
[TABLE]
where , , are a feature reshaped into a matrix with a shape of , , and respectively. Only and are L2-normalized over channel axis for attention map computation. is properly reshaped before summation with .
Residual pathway is another convolutional pathway in parallel with the align-and-refine pathway. It is designed to allow the network to learn single image inpainting, which is to hallucinate novel contents that are never visible throughout the search window. The two pathways are aggregated and fed into single decoder to obtain the final output.
Image Decoder takes the aggregation of the two pathways, together with the warped features of the previous output frames . In our preliminary experiment, we found that adding the intermediate representations from the previous time steps not only provides rich training signals to the whole network, but also enhances the temporal coherency in video results. The decoder recovers the fine details for the hole regions to generate raw output, . It is designed with nearest-neighbor upsampling layers and following convolutions to prevent checkerboard artifacts.
3.1.3 Optical flow estimator
Our optical flow estimator is a simple encoder-decoder model. It computes flow fields between the previous output frame and the current target frame, that is used to enforce temporal consistency.
Flow Encoder takes previous two output frames as input in order to propagate reusable information to the current time step. The encoded features are also fed into the decoder of the video inpainter.
Flow Decoder outputs optical flow from time step to , and a composition mask. We use the predicted flow to warp the previous output onto the current time step . We then blend the two frames into one by the estimated composition mask to obtain the final output of our whole network:
[TABLE]
3.2 Objective functions
Homography estimation. We train the homography network using this objective function:
[TABLE]
The first part is introduced by [21], and the second part is direct L1 loss of transformation parameters. and correspond to the bilinear sampling grids that use predicted parameters and ground-truth respectively. Here, n denotes the total number of sampling coordinates.
Note that, the homography estimation network is trained independently from the video inpainting network ( *i.e.*inpainting network and the optical flow estimation network). After the training, we freeze the network’s parameters and use it as a global affine transformer.
Video inpainting. The objective function is designed to capture pixel-wise reconstruction accuracy, perceptual similarity, and temporal consistency.
The pixel-wise reconstruction loss is defined as follows:
[TABLE]
where t indexes over the number of recurrences, is the binary mask, is the model output, is the ground truth, and indicates the element-wise multiplication.
To ensure perceptual similarity between the predicted output and the ground truth, we adopt both image GAN loss, , and video GAN loss, , that are introduced by [24]:
For the temporal consistency, we use flow loss and warping loss which are defined as:
[TABLE]
where is the pseudo-groundtruth backward flow between the target frames, and , extracted by FlowNet2 [10], is the binary occlusion mask [13]. Note that we use groundtruth target frames in the warping operation since the synthesizing ability is imperfect during training. We employ a curriculum learning scheme that increases the number of recursion, t, by 6 every 5 epochs. We increase the t up to 24.
The total loss is the weighted summation of all the loss functions:
[TABLE]
3.3 Training
Homography estimation. We generate synthetic data using Places2 image dataset [30]. Given a random image , we generate the counterpart by applying an arbitrary transformation to . This provides us great flexibility to gather as many training data as needed, for any 2D geometric transformation. To simulate diverse hole shapes and sizes, we use the irregular mask dataset [16] which consists of random streaks and holes of arbitrary shapes. During training, we apply random affine transformations ( *e.g.*translation, rotation, scaling, sheering) to the mask. All images are resized to pixels for training.
Video inpainting. We employ a two-stage training scheme; 1) We first train the video inpainter without the alignment and the refinement stages to focus on learning a pure synthesis ability. To synthesize the training data, we follow the same protocol mentioned above. 2) We then add previously excluded stages along with the recurrence stream to the model. We fine-tune the whole model using videos in the Youtube-VOS dataset [27]. It is a large-scale video segmentation dataset containing 4000+ YouTube videos with 70+ various moving objects. Since the most realistic appearance and motion can be obtained from the foreground segmentation masks, we use them to synthesize the training video data. All video frames are resized to pixels for training.
3.4 Testing
We use DAVIS dataset [19, 20], which is widely used for video inpainting benchmarking. The videos are very challenging since they include dynamic scenes, complex camera movements, motion blur effects, and large occlusions. We obtain the inpainting mask by dilating the ground truth segmentation mask. Our method processes frame recursively in a sliding window manner.
3.5 Implementation Details
Our model is implemented using Pytorch v0.4, CUDNN v7.0, CUDA v9.0. It run on the hardware with Intel(R) Xeon(R) (2.10GHz) CPU and NVIDIA GTX 1080 Ti GPU. The model runs at 15 fps on a GPU for frames of pixels. We use Adam optimizer with = (0.9, 0.999). The learning rate starts with 2e-4 and divided by 10 every 5 epochs. We train our model from scatch. The homography training and video inpainting training takes about 3 day each using eight NVIDIA GTX 1080 Ti GPUs.
4 Experiments
We evaluate our method both quantitatively and qualitatively. We compare out approach with state-of-the-art methods in three representative streams of study: deep image inpainting [29], deep video inpainting [12], and optimization-based video inpainting [8]. Two metrics are mainly used for the evaluation. The first is the Inception score (FID) [24] extended to videos to measure the perceptual quality in spatio-temporal dimension. The second is flow warping errors between frames that measure temporal consistency of video results. We also conduct extensive ablation studies to validate the proposed design choices.
4.1 User Study on Video Object Removal
We perform a user study to evaluate the visual quality of inpainted videos. We use 20 videos from the DAVIS dataset and compare our method with the strong baselines [8, 12]. A total of 25 users participated in this study. During each test, a user is shown video inpainting results by two different approaches, together with the input target video. We ask the user to check for both image quality and temporal coherency and to choose a better one. The users are allowed to play the videos multiple times to have enough time to distinguish the difference and make a careful judge. We report the ratio that each method outputs are preferred in Table 1a. Our results are considered comparable to the [8], and much higher-quality than [12] by the human subjects. Note that our method runs faster than both approaches (see Table 1b). Some example results are shown in Fig. 3.
4.2 Quantitative comparison
We further compare our method with the baselines [29, 8, 12] using both FID score and warping loss. Since we need the ground truth videos for this experiment, we composite target videos by overlaying foreground mask sequences extracted from other videos. To measure the FID score, we take 20 videos in the DAVIS dataset. For each video, we ensure to choose a different video out of the other 19 videos to make a mask sequence. We use the first 64 frames of both input and mask videos. To measure the flow warping errors, we use Sintel dataset since it provides ground-truth optical flows. We take 32 frames each from 21 videos in Sintel dataset and randomly select 21 videos of length 32+ from DAVIS dataset to create corresponding mask sequences. For both metrics, we run five trials and average the scores over the videos and trials. We summarize the results in Table 1c and Table 1d. We observe a similar tendency to the user study result.
4.3 Ablation studies
We run an extensive ablation study to demonstrate the effectiveness of different components of our method. We measure FID score and warping error following the same protocol as in Sec. 4.2. The results are summarized in Table 2.
Network design choices. The main components of our network design are the two-stage feature aggregation part together with the temporal propagation part. First, we investigate the importance of each stages in the align-and-attend network. If we drop the alignment stage out of the pathway, the refinement stage alone has to pick up valid reference patches to fill in the holes. However, it is difficult for the non-local module to match the zero patches to any reference patches without any priors. If we drop the refinement stage, the real video dynamics (e.g., small, non-rigid motions) cannot be modeled and the resulting videos would lack such fine details. To cancel out the effect of temporal propagation, we drop the flow estimator pathway. Without the recurrence, the temporal consistency is no longer well supported. If we remove multi-frame aggregation and the propagation parts, our network degenerates to a single image inpainting network. As shown in Table 2a, all proposed components have complementary effects, and the best results are obtained when all components are fully used.
Masked matching in non-local attention module. In our non-local attention module, the coarsely completed region in the target hole is matched with the non-hole area in the reference frames. By doing so, the regions that still remain as the holes are ignored during the refinement matching; Only those newly generated patches are touched during the refinement stage. To see the effectiveness of this matching method, we show the results when there was no such constraint ( *i.e.*entire patches in the target frame is matched with the entire patches in the reference frames). As shown in Table 2b, we observe that our proposed matching method is indeed effective, resulting in better video quality.
Recurrence stream. We report the flow warping errors to compare the temporal consistencies of video results before and after adding the recurrence stream (flow estimator pathway). As shown in Table 2c, we observe the warping error is significantly reduced when there is the recurrence. This implies that propagating the previous output significantly improves the temporal consistency of videos. This is also consistent with the recent findings in [11, 12].
5 Conclusion
In this paper, we present a novel deep network for video inpainting. Our model fills in a target hole by referring multiple reference frames in a coarse-to-fine manner. First, we propose homography-based alignment between the reference and target frames to roughly inpaint the missing contents. Second, a non-local attention module refines the previous generated regions. Both stages provide large spatial-temporal window size that have not been achieved by existing flow-based methods. We validate the effectiveness of our approach in real object removal scenarios.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and Joan Verdera. Filling-in by joint interpolation of vector fields and gray levels. In IEEE Trans. Image Processing (TIP) , volume 10, pages 1200–1211. IEEE, 2001.
- 2[2] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages 417–424, 2000.
- 3[3] Marcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Stanley Osher. Simultaneous structure and texture image inpainting. In IEEE Trans. Image Processing (TIP) , volume 12, pages 882–889. IEEE, 2003.
- 4[4] Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Goldman, and Pradeep Sen. Image melding: Combining inconsistent images using patch-based synthesis. In ACM Trans. on Graph. (To G) , volume 31, pages 82–1. Citeseer, 2012.
- 5[5] Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques , pages 341–346. ACM, 2001.
- 6[6] Miguel Granados, Kwang In Kim, James Tompkin, Jan Kautz, and Christian Theobalt. Background inpainting for videos with dynamic objects and a free-moving camera. In Proc. of European Conf. on Computer Vision (ECCV) , pages 682–695. Springer, 2012.
- 7[7] Miguel Granados, James Tompkin, K Kim, Oliver Grau, Jan Kautz, and Christian Theobalt. How not to be seen—object removal from videos of crowded scenes. In Computer Graphics Forum , volume 31, pages 219–228. Wiley Online Library, 2012.
- 8[8] Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Johannes Kopf. Temporally coherent completion of dynamic video. ACM Transactions on Graphics (TOG) , 35(6):196, 2016.
