Learn to Model Motion from Blurry Footages

Wenbin Li; Da Chen; Zhihan Lv; Yan Yan; Darren Cosker

arXiv:1704.05817·cs.CV·April 20, 2017

Learn to Model Motion from Blurry Footages

Wenbin Li, Da Chen, Zhihan Lv, Yan Yan, Darren Cosker

PDF

Open Access

TL;DR

This paper introduces a hybrid CNN and optical flow framework that effectively models motion from blurry footage by capturing blur features and jointly estimating deblurring and motion fields, trained end-to-end on synthetic data.

Contribution

It presents a novel learnable directional filtering layer within a CNN integrated into an iterative optical flow framework for motion estimation from blurry videos.

Findings

01

Achieves competitive accuracy against state-of-the-art methods.

02

Effectively models both deblurring and motion estimation.

03

End-to-end training on synthetic data enhances performance.

Abstract

It is difficult to recover the motion field from a real-world footage given a mixture of camera shake and other photometric effects. In this paper we propose a hybrid framework by interleaving a Convolutional Neural Network (CNN) and a traditional optical flow energy. We first conduct a CNN architecture using a novel learnable directional filtering layer. Such layer encodes the angle and distance similarity matrix between blur and camera motion, which is able to enhance the blur features of the camera-shake footages. The proposed CNNs are then integrated into an iterative optical flow framework, which enable the capability of modelling and solving both the blind deconvolution and the optical flow estimation problems simultaneously. Our framework is trained end-to-end on a synthetic dataset and yields competitive precision and performance against the state-of-the-art approaches.

Equations43

I = k \scalebox 1.5 * ℓ + n

I = k \scalebox 1.5 * ℓ + n

k, ℓ argmin {∥ I - k \scalebox 1.5 * ℓ ∥ + ρ (k)}

k, ℓ argmin {∥ I - k \scalebox 1.5 * ℓ ∥ + ρ (k)}

\centering E (w) = E_{d a t a} (w) + γ E_{s m} (w) \@add@centering

\centering E (w) = E_{d a t a} (w) + γ E_{s m} (w) \@add@centering

B_{1} = k_{2} \scalebox 1.5 * I_{1} \approx k_{2} \scalebox 1.5 * k_{1} \scalebox 1.5 * ℓ_{1}

B_{1} = k_{2} \scalebox 1.5 * I_{1} \approx k_{2} \scalebox 1.5 * k_{1} \scalebox 1.5 * ℓ_{1}

B_{2} = k_{1} \scalebox 1.5 * I_{2} \approx k_{1} \scalebox 1.5 * k_{2} \scalebox 1.5 * ℓ_{2}

E_{d a t a} (w)

E_{d a t a} (w)

+ α B l u r G r a d i e n t C o n s t an cy \int_{Ω} ϕ (∥ \nabla B_{2} (x + w) - \nabla B_{1} (x) ∥^{2}) d x

E_{s m} (w) = \int_{Ω} ϕ (∥ \nabla u ∥^{2} + ∥ \nabla v ∥^{2}) d x

E_{s m} (w) = \int_{Ω} ϕ (∥ \nabla u ∥^{2} + ∥ \nabla v ∥^{2}) d x

f (ω) \circ I (x) = \int_{x} \int_{t} κ G (t) I (x + t ω) d x d t

f (ω) \circ I (x) = \int_{x} \int_{t} κ G (t) I (x + t ω) d x d t

I_{i} = φ_{i} f (iπ /16) \circ I

I_{i} = φ_{i} f (iπ /16) \circ I

I_{i} = j \sum λ_{ij} tanh (C_{j} \scalebox 1.5 * I)

I_{i} = j \sum λ_{ij} tanh (C_{j} \scalebox 1.5 * I)

ℓ_{i} = j \sum δ_{ij} tanh (C_{j} \scalebox 1.5 * I)

k = k argmin I_{*}, ℓ_{*} \sum τ_{*} I_{*} - k \scalebox 1.5 * ℓ_{*}^{2} + β_{k} k^{2}

k = k argmin I_{*}, ℓ_{*} \sum τ_{*} I_{*} - k \scalebox 1.5 * ℓ_{*}^{2} + β_{k} k^{2}

(I_{*}, ℓ_{*}) \in {(\partial_{x} I, \partial_{x} ℓ), (\partial_{y} I, \partial_{y} ℓ), (\partial_{xx} I, \partial_{xx} ℓ),

(\partial_{y y} I, \partial_{y y} ℓ), (\partial_{x y} I, (\partial_{x} \partial_{y} + \partial_{y} \partial_{x}) ℓ /2)}

ℓ = ℓ argmin i \sum I_{i} - k \scalebox 1.5 * ℓ^{2} + β_{ℓ} I_{i}^{2}

ℓ = ℓ argmin i \sum I_{i} - k \scalebox 1.5 * ℓ^{2} + β_{ℓ} I_{i}^{2}

\displaystyle\begin{array}[]{ll}B_{x}=\partial_{x}B_{2}(\textbf{x}+\textbf{w})&B_{yy}=\partial_{yy}B_{2}(\textbf{x}+\textbf{w})\\ B_{y}=\partial_{y}B_{2}(\textbf{x}+\textbf{w})&B_{z}=b_{2}(\textbf{x}+\textbf{w})-B_{1}(\textbf{x})\\ B_{xx}=\partial_{xx}B_{2}(\textbf{x}+\textbf{w})&B_{xz}=\partial_{x}B_{2}(\textbf{x}+\textbf{w})-\partial_{x}B_{1}(\textbf{x})\\ B_{xy}=\partial_{xy}B_{2}(\textbf{x}+\textbf{w})&B_{yz}=\partial_{y}B_{2}(\textbf{x}+\textbf{w})-\partial_{y}B_{1}(\textbf{x})\end{array}

\displaystyle\begin{array}[]{ll}B_{x}=\partial_{x}B_{2}(\textbf{x}+\textbf{w})&B_{yy}=\partial_{yy}B_{2}(\textbf{x}+\textbf{w})\\ B_{y}=\partial_{y}B_{2}(\textbf{x}+\textbf{w})&B_{z}=b_{2}(\textbf{x}+\textbf{w})-B_{1}(\textbf{x})\\ B_{xx}=\partial_{xx}B_{2}(\textbf{x}+\textbf{w})&B_{xz}=\partial_{x}B_{2}(\textbf{x}+\textbf{w})-\partial_{x}B_{1}(\textbf{x})\\ B_{xy}=\partial_{xy}B_{2}(\textbf{x}+\textbf{w})&B_{yz}=\partial_{y}B_{2}(\textbf{x}+\textbf{w})-\partial_{y}B_{1}(\textbf{x})\end{array}

(ϕ^{'})_{B}^{i} \cdot {B_{x}^{i} (B_{z}^{i} + B_{x}^{i} d u^{i} + B_{y}^{i} d v^{i})

(ϕ^{'})_{B}^{i} \cdot {B_{x}^{i} (B_{z}^{i} + B_{x}^{i} d u^{i} + B_{y}^{i} d v^{i})

+ α B_{xx}^{i} (B_{x z}^{i} + B_{xx}^{i} d u^{i} + B_{x y}^{i} d v^{i})

+ α B_{x y}^{i} (B_{y z}^{i} + B_{x y}^{i} d u^{i} + B_{y y}^{i} d v^{i})}

- γ (ϕ^{'})_{S}^{i} \cdot \nabla (u^{i} + d u^{i})

(ϕ^{'})_{B}^{i} \cdot {B_{y}^{i} (B_{z}^{i} + B_{x}^{i} d u^{i} + B_{y}^{i} d v^{i})

+ α b_{y y}^{i} (B_{y z}^{i} + B_{x y}^{i} d u^{i} + B_{y y}^{i} d v^{i})

+ α B_{x y}^{i} (B_{x z}^{i} + B_{xx}^{i} d u^{i} + B_{x y}^{i} d v^{i})}

- γ (ϕ^{'})_{S}^{i} \cdot \nabla (v^{i} + d v^{i})

(ϕ^{'})_{B}^{i}

(ϕ^{'})_{B}^{i}

+ α (B_{x z}^{i} + v_{xx}^{i} d u^{i} + v_{x y}^{i} d v^{i})^{2}

+ α (B_{y z}^{i} + B_{x y}^{i} d u^{i} + B_{y y}^{i} d v^{i})^{2}}

(ϕ^{'})_{S}^{i}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Image and Video Stabilization

Full text

Learn to Model Motion from Blurry Footages

Wenbin Li

Da Chen

Zhihan Lv

Yan Yan

Darren Cosker

Department of Computing, Imperial College London, UK

Department of Computer Science, University College London, UK

Department of Computer Science, University of Bath, UK

Department of Information Engineering and Computer Science (DISI), University of Trento, Italy

Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA), University of Bath, UK

Abstract

It is difficult to recover the motion field from a real-world footage given a mixture of camera shake and other photometric effects. In this paper we propose a hybrid framework by interleaving a Convolutional Neural Network (CNN) and a traditional optical flow energy. We first conduct a CNN architecture using a novel learnable directional filtering layer. Such layer encodes the angle and distance similarity matrix between blur and camera motion, which is able to enhance the blur features of the camera-shake footages. The proposed CNNs are then integrated into an iterative optical flow framework, which enable the capability of modelling and solving both the blind deconvolution and the optical flow estimation problems simultaneously. Our framework is trained end-to-end on a synthetic dataset and yields competitive precision and performance against the state-of-the-art approaches.

keywords:

Optical Flow , Convolutional Neural Network (CNN) , Video/Image Deblurring , Directional Filtering

††journal: Pattern Recognition

1 Introduction

In the image space, the information observed by the dynamical behavior of the object of interest or by the motion of the camera itself is a decisive interpretation for representing natural phenomena. Dense motion, in particular optical flow estimation between a consecutive image pair is the most low-level characterization of such information, which is supposed to estimate a dense field corresponding to the displacement of each pixel. It has become one of the most active fields of computer vision because such characterizations can be extremely embedded into a large number of other higher-level computer vision fields and application domains. Indeed, one can be interested in tracking [1, 2, 3], 3D reconstruction [4], segmentation, as well as the general virtual reality, augmented reality and post-production [5, 6].

A typical pipeline of optical flow estimation has been lied on solving a brightness energy with the assistance of patch detection, matching, constrained optimization and interpolation. For many state-of-the-art approaches – even the precision has reached a reasonable level – the related applications are still limited by the difficult photometric effects and low performance in runtime. In the recent years, the deep Convolutional Neural Networks (CNNs) grows rapidly, which makes a step forward to provide hidden features and end-to-end knowledge representation for many precentral issues e.g. motion and texture style etc. Such knowledge representation is able to improve the robustness and yields a rapid fashion in the typical optical flow pipeline.

Camera-shake blur is a common photometric effect in the real-world footage, which is often caused by the fast camera motion under a low light condition. Such effect may lead to an invariant blur information for each of the pixel, and may bring extra difficulties into typical optical flow estimation because the basic brightness constancy [7] is violated. However, the blur from a daily video footage (24 FPS) can be directionally characterized [8]. This observation enables an extra prior to enhance the camera-shake deblurring [9] and further recover precise optical flow from a blurry images. Such directional prior needs a strict pre-knowledge on the motion direction of the camera which can be obtained by an external sensor [8].

1.1 Contributions

In this paper, we study the issue of recovering accuracy optical flow from frames of a real-world video footage given a camera-shake blur. The main idea is to learn directional filters, encoded the angle and distance similarity between blur and camera motion. Such filters are further applied to enhance the optical flow estimation. Our proposed method only relies on the input images, and does not need any other information e.g. ground truth camera motion and blur prior.

In overview, we propose a novel hybrid approach: (1) we conduct a CNN architecture using a learnable directional filtering layer. Our network is able to extract the blur&latent features from a blurry image, and further recover the blur kernel within an iterative deconvolutional fashion (Sec. 4); (2) we integrate our network into a variational optical flow energy, further optimized within a hybrid coarse-to-fine framework (Sec. 5).

In the evaluation (Sec. 6), we quantitatively compare our method to four baselines on the synthetic Ground Truth (GT) sequences. Those baselines include two blur-oriented optical flow approaches and two other publicly available state-of-the-art methods. We also give quality comparison given real-world blurry footages.

2 Related Work

In this section, we will give brief discussion on the related work in specific fields of image deblurring and optical flow estimation.

2.1 Image Deblurring

Image blur is a common photometric effect for the daily capture. It is often caused by fast camera movement under a low light condition. Such global blur can be formulated as follows:

[TABLE]

where an observed blurred image $I$ can be represented as a combination of spatial noise $n$ along with a convolution between the latent sharp image $\ell$ and a spatial-invariant blur kernel w.r.t. Point Spread Function. To solve the $k$ and $\ell$ , a blind deconvolution is normally performed on $I$ :

[TABLE]

where $\rho$ represents a regularization that penalizes spatial smoothness with a sparsity prior [10]. To solve this ill-posed problem, many approaches rely on additional priors regarding to properties of observed images [11, 12, 13, 14, 15, 16, 17, 18]. Pan et al. [13], for example, propose a blind deconvolution method by taking advantages from the dark channel [19] regarding to the observation that the dark pixels in the observed image are normally averaged with neighboring pixels along the blur. Krishnan et al. [11] introduce a novel scale-invariant regularizer to generate a more stable kernel by fixing the attenuation of high frequencies.

By taking into account the efficient inference, several algorithms [10, 20, 9, 21] are also proposed to solve the deblurring problem. Cho and Lee [10] adopts a predicted edge map as a prior and solve the blind deconvolution energy within a course-to-fine framework. Xu et al. [20], however, discuss a key observation that salient edges do not always help with blur kernel searching. These edges can greatly increase the blur ambiguity in many common scenes. Hence, instead of the use of edge map, they propose an automatic gradient selection scheme to eliminate the “noisy” edges for kernel initialization. Furthermore, Zhong et al. [9] introduce an approach to reduce the noise using a pre-filtering process. Such process preserves the useful image information by reducing the noise along a specific direction.

Both natural image properties based and efficient inference based methods mentioned above are able to provide highly accurate deblurring result for general invariant camera-shake blur. However, these methods often show difficulties given the cases under variant blur. A handful of approaches are proposed to solve such a problem [22, 23, 24, 25, 26]. Gupta et al. [22] propose a Motion Density Function to represent the camera motion which is further adopted to recover the spatially varying blur kernel. Hu et al. [25] consider the various depth information of the scene while most of the deblurring methods apply a constant depth for simplicity. They apply an unified layer-based model to jointly estimate the depth and deblurring result from the underlying geometric relationship caused by camera motions.

Since all the methods mentioned above have the specialty along with their limitation, there is no general solution for images blurred by mixed sources, with regard to mixture of fast camera and object movement and scene depth variance. In this case, the image blur is hard to represent by a global model. With the development of Convolutional Neural Network (CNN), some CNN based deblurring methods are proposed to solve such problem. Hradiš et al. [27] apply a CNN to restore the blurred text documents which is restricted by highly structured data. Xu et al. [28] propose a more general deblurring method. They design a neural network which is guided by traditional deconvolution schemes.

Those mentioned above usually involve a single blurred image as input. There are some hardware assisted methods which are supposed to improve the precision and performance of deblurring [29, 30, 31, 32]. Levin et al. [29] propose a uniform method using the known camera arc motion. Such uniformly deblurred image can be estimated by controlling the camera movement along with a parabolic arc. As an extension of this work, Joshi et al. [30] propose to estimate the acceleration and angular velocity of camera by a inertial sensor, i.e. gyroscopes and accelerometers. Instead of the highly accurate sensor, Hu et al. [32] introduce a deblurring approach using the smartphone inertial sensors. These methods with extra camera motion information often yield higher performance comparing to those methods only rely on single blurred image as input. However, these methods require complex camera setup and precise calibration.

2.2 Optical Flow

Dense motion estimation problem, in particular optical flow, has been widely studied as it can be adopted to many computer vision applications, e.g. video segmentation [33], recognition [34] and virtual reality [35] etc. Many estimation methods have obtained impressive performance in terms of reliability and accuracy showed on the Middlebury [36] and Sintel [37] benchmark. Most of this works are based on the pioneering optical flow method proposed by Horn and Schunck [7]. They combine a data term and a smoothness term into an energy function where the former term assumes the certain constancy of the image feature – typically according to Brightness Constancy Constraint (BCC) – and the latter term controls how the motion field is varied (such as the Motion Smoothness Constraint). This energy function is then optimized across the entire image to reach the global motion field. This original formula is generally applicable but often limited by many challenges such as large displacement, non-rigid motion, motion boundaries discontinuities, motion blur etc. [37]. Numbers of extensive works have been proposed to conquer these challenges by introducing additional constraints and more advanced optimization procedure [38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]. Brox et al. [39] bring a gradient constancy assumption into the data term in order to reduce the dependency of BCC, and bring a discontinuity-preserving spatio-temporal smoothness constraint to deal with motion discontinuities. Xu et al. [41] propose a novel extended coarse-to-fine (EC2F) refinement framework by taking advantages of feature matching technique. Li et al. [42] propose to apply laplacian mesh energy to adapt the non-rigid deformation in the scenes.

Moreover, some Neural Network based methods are recently popular. Revaud et al. [49] propose a edge preserving interpolation based on a sparse deep convolutional matching result. The sparse-to-dense interpolation result is then apply to initialize the optimization process for obtaining the final motion field. However, this method strongly relies on the quality of sparse matching where parameters are set manually. Dosovitskiy et al. [50] propose an automatic approach for matching and interpolation. Guiding by a correlation layer, their network can better predict the flow to initialize the refinement. Furthermore, Teney&Hebert [51] introduce an stand-alone CNN structure for motion estimation requiring less training data. The result, however, is inferior to the state of the art methods.

The presence of blurring features in the scene easily fails the traditional optical flow methods because of the violation to brightness constancy assumption. Only a few of approaches are introduced to settle this problem [52, 53, 54, 55, 56]. Portz et al. [52] treat the appearance of each input frame as a parameterized function combining pixel motion and blur motion. The motion clues are then integrated to the data term in energy function. However, it favors the smooth motion field and usually fail at motion boundaries. To solve this problem, Wulff and Black [54] treat the motion blur as a function of the layer motion and segmentation determined by a generative model. The optimization is then applied to minimize the pixel error between the input blurred images and synthetic image. Tu et al. [55] edit the data term using a blur detection based matching method. Their approach is supposed to improve the flow regularization at motion boundaries. Li et al. [57] embed an additional camera motion channel into a hybrid framework in order to obtain the deblurring result and motion estimation result iteratively. Their method requires a physical motion tracker to obtain the ground truth motion accompanied with the moving camera. Such motion information is supposed to be a hard constraint in the image deblurring step. Besides, their method needs rigid manual tuning for different sequences, e.g. kernel size, the number of levels of image pyramid, etc.

In a quick summary, the current methods show extra difficulties to estimate the optical flow from blurry images because the blur may break the photometric properties and further mislead the common regularization. Our proposed method represents the blur image using CNN features which are then used to recover optical flow within a fast optimization framework.

In the following sections, we first discuss our pipeline for recovering the optical flow from a blurred footage (Sec. 3). We then introduce the main contributions on our novel CNN based deblurring framework (Sec. 4); as well as our hybrid optical flow framework (Sec. 5) and the evaluation (Sec. 6).

3 Recover Motion Field from Blurred Footage

The typical optical flow framework considers a pair of adjacent images, and follows the Brightness Constancy assumption ( $E_{data}$ ) and global smoothness constraint ( $E_{sm}$ ), as follows:

[TABLE]

where $I_{1}(\textbf{x})$ and $I_{2}(\textbf{x})$ denote the current frame and its successor respectively. Those observed images can also be represented using a relative latent image and blur kernel, $I_{*}=k_{*}\mathop{\scalebox{1.5}{\raisebox{-0.86108pt}{$ \ast $}}}\ell_{*}$ , $\{I_{*},\ell_{*}:\Omega\subset\mathbb{R}^{3}\}$ . The optical flow field, denoted by $\textbf{w}=(u,v)^{T}$ can be obtained by solving this functional.

However, given such a pair of blurred images, the blur information may damage image structure and further violate the basic Brightness Constancy assumption of optical flow estimation. Those large number of outliers would lead to uncertain errors to energy optimization. The straight forward solution is to remove the blur before performing the optical flow estimation. The deblurring process may sharpen the images but still permanently change the pixel intensity and further bring unpredictable artifacts. The alternative is to match un-uniform blur [57, 52] between the input images:

[TABLE]

where we have uniform blur images $B_{1}$ and $B_{2}$ which is supposed to use in the Blur Brightness and Blur Gradient Constancy terms:

[TABLE]

where $\nabla=(\partial_{xx},\partial_{yy})^{T}$ denotes a spatial gradient and $\alpha\in[0,1]$ presents a linear weight. The smoothness term regularizes the global flow variation as follows:

[TABLE]

where Lorentzian regularization $\phi(s)=log(1+s^{2}/2\epsilon^{2})$ is applied to preserve motion boundaries. The un-uniform blur matching is supposed to protect the color properties of the images, as well as further keep color correlation and consistency across the input images. In Table 2, we quantitatively evaluate how the blur matching significantly improves the flow precision.

In the following sections, we present our CNNs based approach which consists two stacked modules: (1) A layered network for blind deconvolution; (2) An iterative optical flow framework.

4 A Layered Network for Blind Deconvolution

As show in Fig. 1, we propose an $n$ -iteration coarse-to-fine blind deconvolution module which takes into account a trainable convolutional neural network. For each iteration, the input images are operated with the following processes:

4.1 Directional Filtering

The blur from daily photography may be highly nonlinear and hard to predict. This, however, can be parameterized as a near linear form if it is from a daily video footage captured at ordinary frame rate (24 Hz). In this context, direction filters [9, 57] may be effective to regularize the blur within deconvolution operation. A common form reads:

[TABLE]

where $\textbf{x}=(x,y)^{T}$ represents a pixel location and $G$ denotes a Gaussian kernel and $\omega=(\cos\theta,\sin\theta)^{T}$ controls the filtering direction. The filter is further normalized by $\kappa=\left(\int G(t)dt\right)^{-1}$ . Such directional filter is able to remove the general noise but does not affect the signal along the orthogonal direction. Given a ground truth blur direction $\omega$ , the filtered image $\widetilde{I}^{\omega}=f(\omega+\pi/2)\circ I$ , may destroy the color properties but is supposed to enhance the useful blur information.

In our network, we propose a novel Directional Filtering (DF) layer which calculates a new group of image representation by applying a directional filter across different directions. This process aims to remove the spatial noise while preserve the blur information. We therefore model the first filtering layer using shared weights across all the locations. Our filters read:

[TABLE]

where $f(\omega)$ denotes the directional filter and $\varphi_{i}$ weights the strongness for the specific direction of the filtering. $i$ is the number of the direction sampling. In our implementation, we uniformly select $k+1$ directions $i=\{0,1,\cdots,k\}$ within $\pi/2$ . After the directional filtering, we further construct the deep feature representation for the images.

4.2 Feature Representation

Similar to Schuler et al. [58], our network does not predict the latent image directly but perform a Feature Representation which computes gradient image representations and preliminary estimates for the further kernel and latent image estimation. Our scheme is supposed to extract the features from a subset of pixels. The local feature information is then integrated by a global combination. Such heuristic strategy can greatly shrink the number of parameters for optimization.

To extract our Feature Representation, we adopt a sub-network with three layers, in particular convolution, nonlinearity and linear combination respectively. We first apply a group of Convolutional Filters (CFs) onto the denoised (directional filtered) images, then transform the values using tanh function. In this case, the resulting features are further combined linearly for a new representation as follows:

[TABLE]

where $\mathcal{C}_{*}$ denotes a set of Convolutional Filters that are shared between the $\widehat{I}_{*}$ and $\widehat{\ell}_{*}$ . $\texttt{tanh}(\cdot)$ presents the nonlinearity while the $\lambda_{**}$ and $\delta_{**}$ weight the linear combinations for the $\widehat{I}_{*}$ and $\widehat{\ell}_{*}$ respectively. Fig. 2 shows the intermediary features for each layer of our network. Note that we stack the tanh and linear combination layers twice after the convolutional layer for proper level of nonlinearity [58]. In practice, those layers can be further stacked for difficult cases i.e. strong blur and noise [28].

After those layers, we obtain a new feature representations for $\widehat{I}_{*}$ and $\widehat{\ell}_{*}$ respectively. Those featured images are then used to estimate the latent image and kernel.

4.3 Kernel and Latent Image Estimation

Once we have the current feature representation $\widehat{I}_{*}$ and $\widehat{\ell}_{*}$ , a variation of Cho&Lee [10, 57] is adopted for the Kernel and Latent Image Estimation. Our method consists two steps: (1) Given $\widehat{I}_{*}$ and $\widehat{\ell}_{*}$ , we calculate their gradient maps $\Delta=\{\partial_{x},\partial_{y}\}$ along the horizontal and vertical directions, which is capable of further preserving the high frequency information i.e. edges and image structure. (2) Those resulting gradient maps $\Delta\widehat{I}$ and $\Delta\widehat{\ell}$ are then used to optimize the energy:

[TABLE]

where $\tau_{*}$ linearly weights the derivatives in either directions while $\beta_{k}$ presents a weight for Tikhonov regularization on the kernel. Here both initial $\widehat{I}$ and $\widehat{\ell}$ are from our feature representations. The proposed energy function above is highly nonlinear. which is minimized by following an iterative numerical scheme from [10, 57]. The resulting pre-optimal kernel $\widehat{k}$ can be used to estimate the latent image $\ell$ within a Non-blind Deconvolution process:

[TABLE]

By minimizing the energy function above, we obtain the latent image $\ell$ . Depending on the desired quality of the final deblurring, $\ell$ can be stacked to the blurred image $I$ along the third dimension. Such stacked image is input to our network and run through layers iteratively until obtaining the final $\ell$ and $k$ . In this case, all the learned filters of our network have three dimensions.

In summary, our network only regularizes the the free parameters in Directional Filtering and Feature Representation but fixes the hyper-parameter in Kernel and Image Estimation. In this case, similar to [58], our learning model sticks on learning filters with a limited receptive field instead of the full dimensionality of the input blurred images.

4.4 Parameter Training

Similar to the traditional approaches, we synthesize pairs of latent and blurred images to train our network. We randomly sample 1,000 images (cropped to 480 $\times$ 480 pix.) from the recent large-scale datasets [59] which contains around 34,000 feature-rich sharp images from three synthetic scenes. To synthesize associated blurred images, we first adopt 16 kernels from a real-world footage [57]. As shown in Fig. 3, those kernels are near linear. For each of these selected kernels, 10 variations are generated by rotating for $n\pi/20$ radians. In this case, we obtain 160 distinguishing kernel variations. We randomly resize each of those kernels into the size between $15\times 15$ pixel and $35\times 35$ pixel; then apply them to the selected sharp image respectively. After this process, we obtain 160,000 pairs of training images. During the training, we randomly add either Gaussian (0.01) or Salt&Pepper (0.15) noise.

For obtaining a proper result, we may perform $N$ iterations on our network, which leads to a large number of parameters for training. Here we follow the stage based training strategy from [58]. We start with the first iteration by using the $\mathcal{L}_{2}$ loss function onto the ground truth and estimated results. We then fix the parameters from previous iteration but only update the ones of the next iteration until the last iteration. This training process is supposed to be more efficient against the end-to-end strategy because it limits the number of updated parameters for different training stages. In practice, we adopt a fixed learning rate (0.01) and a decay rate (0.95).

Fig. 5 illustrates the training performance on four variations of our approach i.e. LMoF, LMoF-Deep, LMoF-NoDF and LMoF-Deep-NoDF. Here LMoF denotes our method using a network of 8 DFs, 8 CFs, 8 hidden layers and linear combination while LMoF-Deep presents an enhanced version by using a deeper network of 24 DFs, 48 CFs, 30 hidden layers and linear combination. LMoF-NoDF and LMoF-Deep-NoDF are the related versions without the DF layer. LMoF and LMoF-Deep outperform the related versions without the DF layer. It is worthy noting that with a larger number of DFs, CFs, hidden layers and network iterations, the deblurring quality can be greatly improved.

The experiments in table 1 (LMoF versus LMoF-NoDF) illustrate the final optical flow precision is improved by around 20% for all trials. On the other hand, the deeper version (LMoF-Deep) shows similar comparative measure (LMoF-Deep also gives lower training error against LMoF-Deep-NoDF; and LMoF-Deep gives general the best precision for most of trials in Table 1) We believe that those improvement on training/results is not about the overfitting. However, the number of the network iterations significantly affects the computational speed. In this context, we run three iterations for each of our implementations to balance the general performance with precision.

In the next section, we embed our proposed network into an optical flow framework.

5 An Iterative Optical Flow Framework

Algorithm 1 sketches the proposed CNN based Optical Flow Framework which interleaves our layered network and an iterative optical flow optimization.

Within our framework, the input images $I_{\{1,2\}}$ are first resized into a coarse-to-fine (top-down) pyramid. On each pyramidical level $i$ , (1) the resized images $I^{i}_{\{1,2\}}$ are input into our Layered Network for Blind Deconvolution (Sec. 4) which yields intermediary latent images $\ell^{i}_{*}$ and kernels $k^{i}_{*}$ . (2) Those information is then used to generate uniform Motion Energy for Blurred Images (Sec. 3). (3) Such blurred energy is optimized for the incremental optical flow field $d\textbf{w}^{i}$ (Sec. 5.1). Finally, those parameters $\ell^{i}_{*}$ and $\textbf{w}^{i}$ are then propagated to the next level until convergence. Note that our framework is not a simple combination of image deblurring and optical flow estimation. Our Layered Network for Blind Deconvolution is deeply embedded (Per-level) into every level of image pyramid. And the following blur matching step (step 13, Algorithm 1) further preserves brightness constancy. In this case, the CNN based deblurring process is automatically optimized against the size of image (different levels of image pyramid). Table 2 quantifies the advantage of our Per-level strategy given the ground truth dataset.

In the next subsection, we introduce the our energy optimization scheme in details.

5.1 Energy Optimization

To solve our highly nonlinear optical flow energy Eq. 1, we follow a Nested Fixed Point based optimization scheme [39] which has been recently used in the state-of-the-art approaches. We define:

[TABLE]

We first apply Euler-Lagrange Equations onto the energy Eq. 1. The resulting functional is further minimized within a coarse-to-fine fashion (Algorithm 1). We initialize the flow field $\textbf{w}=(0,0)^{T}$ on the top-coarsest level; and iteratively update the flow field on the next finer level as $\textbf{w}^{i+1}\approx\textbf{w}^{i}+d\textbf{w}^{i}$ . Here $d\textbf{w}$ denotes increments which is still the nonlinearity of the remaining system. Those are the solutions of

[TABLE]

where the terms $(\phi^{\prime})_{B}^{i}$ and $(\phi^{\prime})_{S}^{i}$ contained $\phi$ provide robustness to flow discontinuity on the object boundary. In addition, $(\phi^{\prime})_{S}^{i}$ is also regularizer for a gradient constraint in motion space. All of those terms can be detailed as follows:

[TABLE]

For the further linearization on the system Eqs. (15, 16), please refer to our supplementary document.

5.2 Implementation

In the implementation, we use a customized C++/CUDA version of Caffe [62] for both the network training and testing. In the training period, we sample 8 directions ( $k=8$ ) for the directional filtering. The training takes around a week for each of iteration in a platform with Intel i7 3.5 GHz and GTX 780 4Gb. Furthermore, we implement the optical flow framework using C++; and construct the image pyramid with a downsampling factor of 0.8. The final system is solved using Conjugate Gradients with 60 iterations.

6 Evaluation

In this section, we perform an evaluation by comparing three variations of our proposed approach – i.e. LMoF, LMoF-Deep and LMoF-NoDF (Sec 4.4) – to four other famous optical flow methods, i.e. Portz et al. [52], Li et al. [57], MDP [41] and Brox et al. [39]. Portz et al.’s approach introduce the uniform blurry parameterizations and provides sharp image alignment for both the camera-shake and object blur cases. Li et al.’s method brings the directional filtering to give the recent state-of-the-art precision for the camera-shake blur. MDP is currently one of top method according to Middleburry benchmark [36] while Brox et al.’s show the similar optimization scheme to the proposed method. We use the default parameter setting for all baselines.

In the following subsections, we evaluate our method on a synthetic GT dataset, as well as real-world sequences respectively.

6.1 Customized Benchmark

It is difficult to quantitatively evaluate the optical flow from real-world blurry scenes which may lead to the ambiguous matching issue. Portz et al. propose a synthetic benchmark that gives blurry object motion within a blur-free background but lack of camera-shake blur. Furthermore, by carefully sampling the useful correspondences, Li et al. [8] synthesize a benchmark for camera-shake blurred scenes by convoluting selective blur kernels onto a customized subset of the famous Middleburry dataset. Such benchmark is challenging as it contains many small details that can be easily destroyed by blur.

In this evaluation, we bring more challenges. As shown in Fig. 6, we synthesize two additional GT sequences applying Li et al.’s GT methodology onto selective Sintel [37] sequences (Market3 and Bamboo1, downsampling to $615\times 262$ pix.). Such extra benchmark is supposed to give more difficulties e.g. mixed blur, large displacement and illumination changes, etc.

Table 1 illustrates the quantitative comparison of our methods (three implementations) against the other baselines. Our LMoF-Deep yields the best Average Endpoint Error (AEE) precision for all the sequences. It also competitively ranks the second best Average Angular Error (AAE) for the Market2, and offers the top AAE measure for all other trials. Li et al.’s is the state-of-the-art approach in the community and provides very competitive precision measure comparing to LMoF – a fast version of our method. Their approach results in the second best AEE accuracy for the Grove2, Hydrangea, Rub.Whale and Bamboo1, as well as the third best AEE measure on the Urban2. Our other implementations of LMoF and LMoF-NoDF also outperform the baselines Portz et al.’s, Brox et al.’s and MDP for most of the trials. All our implementations show reasonable speed in the experiments. Note that most baselines give relevantly larger errors ( $>$ 3 pixel AEE and $>$ 6 degrees AAE) on the Market2 because the sequence contains additional difficulties e.g. invariant blur (motion blur and camera blur), large displacements and noise.

Table 1 also demonstrates the advantage of our method comparing to two neural network based optical flow approaches, i.e. FlowNet [50] (FlowNetS and FlowNetC) and Teney&Hebert [51] which provide an end-to-end process to recover optical flow from input images. We observe that both implementations of FlowNet (FlowNetS and FlowNetC) yield large error for the small motion scenes (Grove2, Hydragnea, Rub.Whale and Urban2) while they give relatively higher accuracy for the large motion cases i.e. Bamboo1 (2.00 pixel AEE) and Market2 (4.01 pixel AEE). Furthermore, Teney&Hebert encodes a hidden coarse-to-fine optimizer within the network. With this advantage, they give improved results for the small motion scenes and outperform the traditional approaches Brox et al. and MDP in most of trials. However, our methods produce the top precision measure for all the sequences except Matket2 (FlowNetS, 8.18 degrees AAE).

Fig. 7 visualizes the AEE errors of all the baselines on Bamboo1 and Market2. Our methods yield less details loss and clearer object boundaries in overall. Here Brox et al.’s overly smooth the object details of the scene. And MDP leads to extra errors because their feature detection and matching process is compromised by the blur, and even brings error into the final energy. We observe that all the baselines result in large errors on left area of the Market2 because the object there is moving quickly and leads to extra motion blur. Such invariant blur cannot be solved by any of our baselines, as well as is out of this paper’s scope.

Moreover, Table 2 shows the quantitative analysis given different deblurring strategies on our proposed approach. Here Per-level denotes the deblurring strategy used in our implementations. For each level of image pyramid (coarse-to-fine), our deblurring network stacks the blur image and the latent one propagated from previous level in order to compute the optimized latent image. This latent image is then propagated to the next level. Hence, on the final level, our deblurring network runs $N$ (the number of network iterations, Fig. 1) $\times$ $M$ (the number of levels of image pyramid) network iterations on each of input images. However, Independent deblurring presents the process where our deblurring network and optical flow optimization are treated as two independent steps. In this case, deblurring network runs only once on the full resolution images. We observe that our Per-level is able to significantly improve the precision on both small (Urban2) and large motion (Market2) scenes but use longer time (29s vs. 9s) for computation. The quantitative analysis also illustrates that the Blur Matching(see Sec, 3) can also improve the final results.

In Table 3, we further evaluate how our deblurring network contributes to the final optical flow estimation. To highlight our advantage, we propose eight variations by replacing our deblurring network with four selected deblurring approaches in either Independent or Per-level fashion. Hradiš et al. [27] and Chakrabarti [60] are neural network based approaches. The former gives high quality deblurring result on fine details e.g. text and license plate; while the latter achieves the state-of-the-art for the general real-world scenes. Levin et al. [61] and Xu&Jia [20] are non-learning methods. The former is one of the most popular approaches in practice; while the latter shows high performance on the noisy image. Please note that we adopt the default and fixed parameter setting through all the trials.

It is observed that our method (both Independent and Per-level) yields the best precision measure for both trials while they are also much faster than any other baselines. We also observe that Hradiš et al. [27] and Chakrabarti [60] provide improved error measure when they are applied by a Per-level deblurring strategy. However, Levin et al. [61] and Xu&Jia [20] result in relevantly higher accuracy when they are performed as an Independent process. Our optimization framework (FE) contains a coarse-to-fine image pyramid in a top-down fashion. In the Per-level deblurring strategy, the baseline has to be performed on different resolutions (different levels of image pyramid) of the image. It is difficult for the non-learning methods to adapt to different resolutions without manually tuning the parameters. This issue may bring extra errors. However, the neural network based approaches are supposed to improve this issue if the training data is sufficient to cover different sizes of blur kernels.

Using the sequence Hydrangea, Fig. 8 quantizes and visualizes the effects by ramping up the distribution of the noise. By increasing the distribution of noise, all of our baselines give more errors in overall. The AEE of our implementations are still on the relevantly reasonable level ( $>$ 3.2 pix. AEE) while the errors of Brox et al., Portz et al. and MDP are climbing up quickly. Given the largest noise level ( $45\%$ ), our LMoF-Deep gives the best precision (1.61 pix. AEE). And the Li et al.’s yields a very competitive measure (1.88 pix. AEE) while LMoF gives 2.13 pix. AEE. The robustness of those three approaches against the noise benefits from the directional filtering which efficiently removes the noise but preserves the useful information [8].

Within this evaluation, we compare our proposed approach to recently popular Li et al. [57] which uses ground truth camera motion to regularize the optical flow estimation. They give good precision on real-world blurry footages but additional hardware and difficult calibration are strictly required. They also have to tune parameters carefully for various scenes. Our method models the optical flow from blurry footage using convolutional neural network. This is an end-to-end unsupervised approach which does not need any manual parameter tuning or additional information/hardware. It is able to provide rapid computation and adapt to various image resolutions and kernel sizes. In our quantitative analysis (Table 1), our method produces more than $30\%$ AEE improvement and $10\%-25\%$ faster comparing to Li et al. [57].

6.2 Real-world Scenes with Camera-shake Blur

To illustrate the feasibility of our method, we qualitatively compare our approach to other baselines on the real-world sequences. As shown in Fig. 9, from left to right, there are sequences Chessboard, Desktop and Books. Chessboard contains real-world photometric effects of nonrigid deformations and small occlusions while the Desktop represents the large camera motion and some featureless regions. Books give large displacement and out-of-plane rotation. We observe that our methods give the sharper flow on object boundaries, as well as shape preservation in the image warping.

7 Conclusion

In this paper, we investigate the problem for recovering optical flow from a camera-shake video footage. We first propose a novel CNNs architecture for video frame deblurring using an extra directional similarity and filtering layer. In practice, such learnable filters are able to adoptively preserve the directional blur information without the pre-knowledge of the camera motion. We then highlight the benefits of the Per-level integration of our network into an iterative optical flow framework. The evaluation demonstrates our hybrid framework gives the overall competitive precision and higher performance in runtime.

The limitations of our method may lie in the presence of mixed blur, globally invariant blur and spatial noise. Such difficulties could be improved by using more comprehensive training data.

Acknowledgements

This work was partially conducted when Wenbin Li was affiliated to UCL Department of Computer Science and University of Bath. We thank Gabriel Brostow and the UCL PRISM Group for their helpful comments. The authors are partially supported by Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA) EP/M023281/1; and EPSRC projects EP/K023578/1 and EP/K02339X/1.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] W. Li, D. Cosker, M. Brown, An anchor patch based optimisation framework for reducing optical flow drift in long image sequences, in: Asian Conference on Computer Vision (ACCV’12), Springer, 2012, pp. 112–125.
2[2] W. Li, D. Cosker, M. Brown, Drift robust non-rigid optical flow enhancement for long sequences, Journal of Intelligent and Fuzzy Systems 31 (5) (2016) 2583–2595.
3[3] R. Tang, D. Cosker, W. Li, Global alignment for dynamic 3d morphable model construction, in: Workshop on Vision and Language (V&LW’12), pp. 1–2.
4[4] C. Godard, P. Hedman, W. Li, G. J. Brostow, Multi-view reconstruction of highly specular surfaces in uncontrolled environments, in: 3D Vision (3DV), 2015 International Conference on, IEEE, 2015, pp. 19–27.
5[5] W. Li, F. Viola, J. Starck, G. J. Brostow, N. D. Campbell, Roto++: Accelerating professional rotoscoping using shape manifolds, ACM Transactions on Graphics (In proceeding of ACM SIGGRAPH’16) 35 (4).
6[6] G. Ren, W. Li, E. O’Neill, Towards the design of effective freehand gestural interaction for interactive tv, Journal of Intelligent and Fuzzy Systems 31 (5) (2016) 2659–2674.
7[7] B. Horn, B. Schunck, Determining optical flow, Artificial intelligence 17 (1-3) (1981) 185–203.
8[8] W. Li, Y. Chen, J. Lee, G. Ren, D. Cosker, Robust optical flow estimation for continuous blurred scenes using rgb-motion imaging and directional filtering, in: IEEE Winter Conference on Application of Computer Vision (WACV’14), IEEE, 2014, pp. 792–799.