TL;DR
This paper introduces a novel high framerate video frame synthesis framework that fuses low-speed frame data with high-speed event data, utilizing a differentiable model and deep learning to improve video quality in challenging scenes.
Contribution
The paper presents a new hybrid sensor fusion framework with a differentiable model and a deep learning denoiser for high framerate video synthesis, outperforming existing methods.
Findings
Better performance than state-of-the-art methods
Effective handling of fast motion and occlusions
Enhanced video quality with contrast and motion awareness
Abstract
Temporal Video Frame Synthesis (TVFS) aims at synthesizing novel frames at timestamps different from existing frames, which has wide applications in video codec, editing and analysis. In this paper, we propose a high framerate TVFS framework which takes hybrid input data from a low-speed frame-based sensor and a high-speed event-based sensor. Compared to frame-based sensors, event-based sensors report brightness changes at very high speed, which may well provide useful spatio-temoral information for high framerate TVFS. In our framework, we first introduce a differentiable forward model to approximate the physical sensing process, fusing the two different modes of data as well as unifying a variety of TVFS tasks, i.e., interpolation, prediction and motion deblur. We leverage autodifferentiation which propagates the gradients of a loss defined on the measured data back to the latent high…
| clip name | plug &play | one-time |
|---|---|---|
| Motorcycle | 28.07 / .951 | |
| Car race | 24.53 / .883 | |
| Football Player | 29.94 / .935 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Event-driven Video Frame Synthesis
Zihao W. Wang1 Weixin Jiang1 Kuan He1 Boxin Shi2 Aggelos Katsaggelos1 Oliver Cossairt
1 Northwestern University 2 Peking University
{winswang, weixinjiang2022}@u.northwestern.edu
Abstract
Temporal Video Frame Synthesis (TVFS) aims at synthesizing novel frames at timestamps different from existing frames, which has wide applications in video codec, editing and analysis. In this paper, we propose a high framerate TVFS framework which takes hybrid input data from a low-speed frame-based sensor and a high-speed event-based sensor. Compared to frame-based sensors, event-based sensors report brightness changes at very high speed, which may well provide useful spatio-temoral information for high framerate TVFS. In our framework, we first introduce a differentiable forward model to approximate the physical sensing process, fusing the two different modes of data as well as unifying a variety of TVFS tasks, i.e., interpolation, prediction and motion deblur. We leverage autodifferentiation which propagates the gradients of a loss defined on the measured data back to the latent high framerate video. We show results with better performance compared to state-of-the-art. Second, we develop a deep learning-based strategy to enhance the results from the first step, which we refer as a residual “denoising” process. Our trained “denoiser” is beyond Gaussian denoising and shows properties such as contrast enhancement and motion awareness. We show that our framework is capable of handling challenging scenes including both fast motion and strong occlusions. Supplementary material, demo and code are released at: https://github.com/winswang/int-event-fusion/tree/win10.
1 Introduction
Conventional video cameras capture intensity signals at fixed speed and output signals frame by frame. However, this capture convention is motion agnostic. When the motion in the scene is significantly faster than the capturing speed, the motion is usually under-sampled, resulting in motion blur or large discrepancies between consecutive frames, depending on the shutter speed (exposure time). One direct solution to capture fast motion is to use high speed cameras, in exchange with increased hardware complexity, degraded spatial resolution and/or reduced signal-to-noise ratio. Moreover, high speed moments usually happen instantaneously between regular motion. As a consequence, either we end up collecting long sequences of frames with a great amount of redundancy, or the high-speed moment is missed before we realize to turn on the “slow-motion” mode.
We argue that high speed motion can be acquired and synthesized effectively by augmenting a regular-speed camera with a bio-inspired event camera [8, 24]. Compared to conventional frame-based sensors, event pixels independently detect logarithmic brighness variation over time and output “events” with four attributes: 2D spatial location, polarity (e.g., “1”: brightness increases; “0”: brightness decreases) and timestamp ( latency). This new sensing modality has salient advantages over frame-based cameras: 1) the asynchronism of event pixels results in sub-millisecond temporal resolution, much higher than regular-speed cameras ( FPS); 2) since each pixel responds only to intensity changes, the temporal redundancy and power consumption can be significantly reduced; 3) sensing intensity changes in logarithmic scale enlarges dynamic range to over 120 dB111Typical dynamic range of a conventional camera is 90 dB. However, event-based cameras have increased noise-level over low framerate cameras. And the bipolar form of output does not represent the exact temporal gradients, introducing challenges for high framerate video reconstruction from event-based cameras alone.
In this paper, we propose a high framerate video synthesis framework using a combination of regular-speed intensity frame(s) and neighboring event streams, as shown in Fig. 1. Compared to intensity-only or event-only TVFS algorithms, our work takes advantages from both ends, i.e., high-speed information from events and high contrast spatial features from intensity frame(s).
Our contributions are listed below:
We introduce a differentiable fusion model enabling to solve various temporal settings. We consider three fundamental cases, i.e., interpolation, prediction and motion deblur, which can serve as building blocks for other complex settings. The problem can be solved by automatic differentiation. We refer to this process as Differentiable Model-based Reconstruction (DMR). 2. 2.
We introduce a novel event binning strategy and compare it against conventional stacking-based binning strategy [2, 3, 34, 40]. Our binning preserves the temporal information of events necessary for high frame-rate video reconstruction. Additionally, we perform statistical evaluation for our binning strategy on the existing dataset [29]. 3. 3.
We introduce a deep learning strategy for further improving the DMR results. We model the DMR artifacts as additive “noise” and perform “denoising” via deep residual learning. During training, we augment the samples by randomizing all the parameters of the DMR. We show preliminary results that the trained residual denoiser (RD) has properties including constrast enhancement and motion awareness, which is beyond a Gaussian denoiser.
2 Related work
Multimodal sensor fusion. Fusion among different types of sensing modalities for improved quality and functionality is an interesting topic. A related problem to ours is to spatially upsample functional sensors, e.g., depth or hyperspectral sensors, with a high resolution guide image. The fusion problem can be formulated as joint image filtering via bilateral [20], multi-lateral filters [9] or Convolutional Neural Network (CNN) based approach [23]. For high-speed video sensing, a fusion strategy can be employed between high-speed video cameras (low spatial resolution) and high spatial resolution still cameras (low speed) [5, 12, 13, 37, 44].
Our paper investigates the temporal upsampling problem. While previous approaches investigate in the framework of compressive sensing [1, 14, 17, 26, 35, 38, 41], we formulate our work as fusing event streams with intensity images to obtain a temporally dense video. Compared to existing literature [36] which integrates event counts per pixel across time, our differentiable model utilizes “tanh” functions as event activation units and imposes sparsity constraints on both spatial and temporal domain.
Event-based image and video reconstruction. Converting event streams (binary) to multiple-valued intensity frames is a challenging task, yet has been shown beneficial to downstream visual tasks [34]. Existing strategies for image reconstruction include dictionary learning [3], manifold regularization [30], optical flow [2], exponential integration [32, 36], conditional Generative Adversarial Networks (GAN) [40] and recurrent neural network [34]. Compared to existing algorithms, our work is the first, to the best of our knowledge, to unify different temporal frame synthesis settings, including interpolation, extrapolation (prediction) and motion deblur (reconstructing a video from a motion-blurred image).
Non-event-based video frame synthesis. 1) Interpolation: Early work on video frame interpolation has focused on establishing block-wise [10] and/or pixel-wise [21, 27] correspondences between available frames. Improved performance has been achieved via coarse-to-fine estimation [4], texture decomposition [42], and deep neural networks (DNN) [16]. Recent DNN-based approaches include deep voxel flow [25], separable convolution [31], flow computation and interpolation CNN [18]. 2) Prediction: Recent work on future frame prediction has proposed to use adversarial nets [28], temporal consistency losses [6] and layered cross convolution networks [43]. 3) Motion deblur: Recent work on resolving a sharp video/image from blurry image(s) has leveraged adversarial loss [22], gated fusion network [47], ordering-invariant loss [19], etc.
3 Approach
3.1 Image formation
Assume there exists a high framerate video denoted by tensor , 222 is indexed on time axis starting from 1. Color channel is omitted here.. The forward sensing process results in two observational tensors, i.e., the intensity frame tensor and event frame tensor . Our goal is to recover tensor based on the observation of intensity and event data.
Intensity frame tensor. We consider three sensing cases, i.e. 1) interpolation from the first and last frames of ; 2) prediction based on the first frame of and 3) motion deblur, in which case the intensity tensor is the summation over time. This can be visualized in Fig. 2.
Event frame tensor. As previously introduced, a pixel fires a binary output/event if the log-intensity changes beyond a threshold (positive or negative). This thresholding model can be viewed in Fig. 3(a). Mathematically, the event firing process can be expressed as,
[TABLE]
where . If , no events are generated. In order to approximate this event firing process, we model each event frame as a function of the adjacent frames from the high framerate tensor , i.e.,
[TABLE]
where is a tuning parameter to adjust the slope of the activation curve. This function can be viewed in Fig. 3(b). Based on this formulation, a video tensor with temporal frames correspond to event frames.
3.2 Differentiable model-based reconstruction
The DMR is performed by minimizing a weighted combination of several loss functions. The objective function is formed as,
[TABLE]
Pixel loss. The pixel loss includes per-pixel difference loss against intensity and event pixels in norm, i.e.,
[TABLE]
over the entire available data range. and denote the captured intensity and event data, respectively. and denote the forward sensing models described in Fig. 2 and Equation (2). represents expectation with respect to the observed pixels/events.
Sparsity loss. We employ total variation (TV) sparsity in the spatial and temporal dimensions of the high-res tensor . The TV sparsity loss is defined as:
[TABLE]
where and . We later denote \mathcal{L}_{TV_{xy}}=\mathbb{E}_{hpix}\big{[}\norm{\dot{\mathscr{H}}_{xy}}_{1}\big{]} and \mathcal{L}_{TV_{t}}=\mathbb{E}_{hpix}\big{[}\norm{\dot{\mathscr{H}}_{t}}_{1}\big{]}. can be viewed as a denoising term for intensity tensor, and can be viewed as an event denoising term. A comparison of the performance for each loss function is shown in Fig. 4. The figure shows a synthetic case for single-frame interpolation. We use three frames, resulting in two event frames (Equation (1)). Combining the spatial and temporal TV losses resullts in better performance.
Implementation. We use stochastic gradient descent to optimize Equation (3) so as to reconstruct the latent high-res tensor. Our algorithm is implemented in TensorFlow. We use Adam optimizer. The learning rate varies depending on the tensor size as well as related parameters. Empirically, we recommend 0.002 as initial value. We recommend to schedule the learning rate to decrease every 200 epochs. The momenta . For the case of interpolation, we initialize the high-res tensor by linearly blending the two available low-res frames. For prediction and motion deblur, we initialize the high-res tensor using the available single low-res frame. An example of the optimization progress can be viewed in Fig. 5. As the loss decreases, both PSNR and SSIM increase and gradually converge.
3.3 Binning events into event frames
Our event sensing model requires binning events into frames. The ideal binning strategy would be “one frame per event”. However, this binning strategy is unnecessarily expensive. For example, the events between two consecutive frames (22 FPS in [29]) may vary from thousands to tens of thousands, resulting in computational challenges and redundancy. However, events happening at different locations but at very close timestamps can be processed in the same event frame. Therefore, we design and compare two binning strategies:
Binning 1 (proposed): For an incoming event, if its spatial location already has an event in the current event frame, then cast it into a new event frame; otherwise, this incoming event will stay in the current event frame. In this case, each event frame should only have three values, i.e., {-1, 0, 1}.
Binning 2: Similar to several previous work [2, 3, 34, 40], where events are stacked/integrated over a time window, we allow each event frame to have more than three values. However, since the “tanh” function in Equation (2) only outputs values between -1 and 1, we modify our event sensing model to have a summation operation over several sub-event frames. Mathematically, .
We show DMR results for a frame interpolation case using DAVIS dataset [29] in Fig. 6. We use two consecutive intensity frames and the events in-between. In Row 1 (“slider_depth”), 9 event frames are binned from over 7, 700 events using Binning 1. Row 2 (“simulation_3_planes”) has 19 event frames from over 40, 000 events. For Binning 2, we match the sub-event frame number with Binning 1 so as to compare the performance. Frame #2 is shown. Our results show that Binning 1 preserves sharp spatial structures333A more detailed analysis and complete slow motion videos can be found in the supplementary material.. For subsequent experiments, we use Binning 1.
3.4 Learning a residual denoiser
Although our proposed DMR can handle a variety of fusion settings, we observe that the DMR results may have visual artifacts. This is due to the ill-posedness of the fusion problem and different noise levels between the two sensing modalities. In order to address these issues, we model the artifacts outcome of DMR as additive “noise” and propose a “denoising” process to remove the artifacts. Inspired by ResNet [15] and DnCNN [45], we employ the residual learning scheme and train a residual denoiser (RD). Rather than training the denoiser from various levels of artificial noise, we design to train the network from the outcome of DMR. Mathematically, the residual is expressed as,
[TABLE]
where represents the reconstructed frame from DMR, and represents the ground truth frame. We use a residual block similar to [46], which has a {conv + ReLU} and a {conv} layer at the beginning and end, with 17 intermediate layers of {conv + BN + ReLU}. The kernel size is with stride of 1. The loss function for our denoiser is the mean squared error of and . During training, we augment data by randomizing the configuration parameters (including the running epochs) in DMR, summarized in Table 1. The goal of this augmentation is 1) to prevent overfitting; 2) to enforce learning of our DMR process; 3) to alleviate effects due to non-optimal parameter tunning. Our denoiser is single-frame, as we seek to enhance each DMR output frame iteratively without comprimising the variety of DMR fusion settings.
4 Experiment results
We design several experiments to show the effectiveness of our algorithm. For DMR, we evaluate the three cases described in Fig. 2 on the DAVIS dataset [29], and compare against state-of-the-art event-based algorithms, i.e., Complementary Filter [36] and Event-based Double Integral [32]. For RD, we evaluate the effectiveness of our learning strategy by comparing with Gaussian denoisers, e.g., DnCNN [45] and FFDNet [46]. We finally compare our results with a non-event-based frame interpolation algorithm, SepConv [31].
4.1 Results for DMR
Interpolation. We first show interpolation results in Fig. 7. We use three consecutive frames from [29], withholding the middle frame. The intermediate events bin into 20 event frames. The ground truth middle frame is the closest to Frame #10.
Prediction. We next show frame prediction results, corresponding to Case 2 in Fig. 2. We withhold the end frame of two consecutive frames and seek to predict it using the start frame and “future” events. The results are shown in Fig. 8. Compared to CF [36], our results are less noisy and closer to the ground truth.
Motion deblur. Corresponding to Case 3 in Fig. 2, we compare our DMR results with state-of-the-art, Event-based Double Integral (EDI) [32], shown in Fig. 9. Compared to EDI, our results preserves sharp edges while alleviating event noise.
4.2 Results for RD
Data preparation. We use publicly available high-speed (240 FPS) video dataset, the Need for Speed dataset [11]. The reason we choose this dataset is because it has rich motion categories and content (100 videos with 380K frames) which involves both camera and scene/object motion. As introduced in Section 3.4, our RD is trained on the output of DMR process. As a proof of concept, we simulate solving a single-frame prediction problem, i.e. given two consecutive video frames, we first simulate the latent event frame. Next, a DMR is performed to predict the end frame.
Training and testing. We randomly split the dataset into 89 training classes and 11 testing classes. For augmentation purpose, we perform a random temporal flip and a spatial crop with size . The sample clip will then experience event frame simulaltion and DMR using a random setting according to Table 1. Note that we enforce generated event frames to contain less than 20% of events. This is according to a statistical analysis of the DAVIS dataset444A statistical analysis is included in the supplementary material. We generate 100K image pairs of size pixels; 80% of the sample dataset are randomly chosen as training samples and the rest 20% are used for validation. We use a batch size of 128, which results in 2K batches per epoch. We use mini-batch stochastic gradient descent with an Adam optimizer (). The learning rate is scheduled as for the initial 30 epochs, then for the following 30 epochs and afterwards. We use an NVIDIA TITAN X GPU for parallelization. Each epoch takes approximately 6 minutes training on our machine. We train our network for 150 epochs. Since our model is fully convolutional, the number of parameters is independent of the image size. This enables us to train on small patches () and test on the whole image.
Plug & play vs. one-time denoising. Since we train our denoiser to establish a mapping function between DMR and its residual towards the ground truth, the first experiment we investigated is how/when to use this denoiser. We compare two frameworks, i.e., the plug & play [39] and the one-time denoising. The plug & play framework decouples the forward physical model and the denoising prior using the ADMM technique [7]. For one time denoising, we apply the residual denoiser once after the DMR has converged. One-time denoising is considered because it is considerably faster than plug & play. Our experimental results show that one-time denoising performs similar or even better than plug & play, shown in Table 2. We reason that this is related to our training process and the initialization of the high-res tensor. Our differentiable model involves a temporal transition process from an existing frame to a future frame. We initialize the high-res tensor with the reference intensity frame. In each DMR iteration, the reconstruction process produces artifacts that are similar to the degradations in the initialized image. However, our denoiser is trained to “recognize” this degradation and remove these artifacts. Therefore, our denoiser is most useful and efficient when applied after the DMR has converged555Visual results are included in supplementary material..
Comparison with Gaussian denoisers. Since we decouple the problem as DMR and RD process, it is interesting to see whether a general denoiser can complete this task. We select several video clips from the testing classes and compare our results with two other denoisers, DnCNN [45] and FFDNet [46]. DnCNN is an end-to-end trainable deep CNN for image denoising with different Gaussian noise levels, e.g., [0, 55]. During our testing of DnCNN we found that the pre-trained weights do not perform well. We retrained the network using the Need for Speed dataset with Gaussian noise. The FFDNet is a later variant of DnCNN with the inclusion of pre- and post-processing. During our tuning of the FFDNet, we found that smaller noise levels (a tunable parameter for using the model) result in better denoising performance in terms of PSNR and SSIM metrics. For each testing image, we present the best tuned FFDNet result (noise level less than 10) and compare with our proposed denoiser. The results are summarized in Table 3. Partial results666Full results can be seen in the supplementary material. with zoom-in figures are presented in Fig. 10.
4.3 Comparison to non-event-based approach
We compared our results for performing multi-frame interpolation with a state-of-the-art approach, SepConv [31]. We present results comparing 3-frame interpolation in Fig. 11. We convert our grayscale testing images to 3 channels (RGB) before applying the SepConv interpolation algorithm. Although the results from SepConv provide better visual experience, they have salient artifacts around large motion regions. Note that performing intensity only frame interpolation produces significant artifacts in the presence of severe occlusions. On the other hand, our event-driven frame interpolation is able to successfully recover image details in occluded regions of interpolated frames777Please see videos of results in supplementary material.. For a quantitive comparison, the SepConv method has an average SSIM of 0.9566 and PSNR of 29.79. Ours have average SSIM of 0.9741 and PSNR of 37.64.
5 Concluding remarks
In this paper, we have introduced a novel high framerate video synthesis framework by fusing intensity frames with event streams, taking advantages from both ends. Our framework includes two key steps, i.e., DMR and RD. Our DMR is free of training and is capable to unify different fusion settings between the two sensing modalities, which was not considered in previous work such as [32, 36]. We show in real data that our DMR performs better than existing algorithms. We show in simulation that a RD can be trained to effectively remove artifacts from DMR. Currently we train an RD from single-frame prediction case. It is interesting to further augment the training samples with all the cases, which we will investigate in the future. Applying our RD to real data faces a domain gap due to the resolution (both spatial and temporal) and noise level mismatch. Currently, none of the existing DAVIS datasets contains enough sharp intensity images captured at high speed for training/fine-tuning. We will investigate event simulation using event simulator [33] in our future work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. G. Baraniuk, T. Goldstein, A. C. Sankaranarayanan, C. Studer, A. Veeraraghavan, and M. B. Wakin. Compressive video sensing: algorithms, architectures, and applications. IEEE Signal Processing Magazine , 34(1):52–66, 2017.
- 2[2] P. Bardow, A. J. Davison, and S. Leutenegger. Simultaneous optical flow and intensity estimation from an event camera. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 884–892, 2016.
- 3[3] S. Barua, Y. Miyatani, and A. Veeraraghavan. Direct face detection and video reconstruction from event cameras. In Proc. of the Winter Conference on Applications of Computer Vision (WACV) , pages 1–9, 2016.
- 4[4] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based motion estimation. In Proc. of the European Conference on Computer Vision (ECCV) , pages 237–252. Springer, 1992.
- 5[5] P. Bhat, C. L. Zitnick, N. Snavely, A. Agarwala, M. Agrawala, M. Cohen, B. Curless, and S. B. Kang. Using photographs to enhance videos of a static scene. In Proc. of the 18th Eurographics conference on Rendering Techniques , pages 327–338. Eurographics Association, 2007.
- 6[6] P. Bhattacharjee and S. Das. Temporal coherency based criteria for predicting video frames using deep multi-stage generative adversarial networks. In Advances in Neural Information Processing Systems , pages 4271–4280, 2017.
- 7[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning , 3(1):1–122, 2011.
- 8[8] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck. A 240 × \times 180 130 db 3 μ 𝜇 \mu s latency global shutter spatiotemporal vision sensor. Journal of Solid-State Circuits , 49(10):2333–2341, 2014.
