Space-Time Attention with Shifted Non-Local Search
Kent Gauen, Stanley Chan

TL;DR
This paper introduces Shifted Non-Local Search, a memory-efficient and faster method for computing attention in videos that improves alignment and denoising quality by correcting small spatial errors in predicted motion offsets.
Contribution
It proposes a novel search strategy that combines non-local search quality with offset prediction range, enhancing video attention modules with improved accuracy and efficiency.
Findings
Improves video frame alignment by over 3 dB PSNR
Reduces memory usage by 10x and speeds up by over 3x
Enhances video denoising results by 0.30 dB PSNR
Abstract
Efficiently computing attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less…
Peer Reviews
Decision·Submitted to ICLR 2024
1) Non global image matching/search is a hard problem. Applications related to video analysis (object tracking, denoising) are important. 2) The proposed approach (first predicting an off-set, then refinining the estimation in a local search window) is intuitive. The merits of the approach are shown experimentally (frame alignment, space-time attention for video denoising).
1) The technical explanations of the implementation of the approach are difficult to follow. Section 3.1 and 3.2 could certainly be clarified and simplified. For example, specify the meaning of the indices, use different letters for different variables (what is the difference between I and \tilde_I?; if K_v is the variable for the Keys then do not use K again to denote the number of neighbor, etc) 2) The results section is not clear to me. I suggest the authors to start the experiments section
1. The authors demonstrate that optical flow requires only minor spatial corrections for frame-wise alignment. 2. The authors introduce In-Place Computation, which significantly reduces the memory working set and consequently enhances speed. 3. The proposed method achieves state-of-the-art results on video denosing task.
1. The way authors show that optical flow only needs small spatial corrections is from the results of Sintel-Clean benchmark, however, this setting is far from the real-world dataset, where blur and degradation could happens. Moreover, these results are from methods with high computational cost, which is not feasible for the online setting. 2. the idea is already explored in video enhancement task, such as BasicVSR++ [1] RVRT [2], where the deformable convolutions/attentions' offsets are compute
This approach is somewhat on the hardware side and is thus very advantageous in terms of speed and memory consumption over other methods. The method is implemented "in-place", whatever it means as no details are disclosed, so fewer memory consumption is very attractive compared to recent memory-hungry large models. Denoising performance is better than others as shown in Tables 1 and 2, and table 2 shows that the proposed method has a good trade-off between computation time and gpu memory.
Patch-based offset correction: as long as reading section 3.1, similarities for search are computed to each "reference locations" and "search locations", depending on strides S_Q and S_K. Given the predicted offset F, which is floating point coordinates, corrected coordinates reside in the integer grid. In experiments strides were set to 2 (probably, as it is now shown), however it is slower as shown in Table 10. There are no experiments on denoising and alignment with stride 2, it is difficult
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Photoacoustic and Ultrasonic Imaging · Image Processing Techniques and Applications
