Modulo Video Recovery via Selective Spatiotemporal Vision Transformer
Tianyu Geng, Feng Ji, Wee Peng Tay

TL;DR
This paper introduces SSViT, a novel deep learning framework using a selective spatiotemporal transformer for high-quality modulo video reconstruction, outperforming previous methods in efficiency and accuracy.
Contribution
We develop the first deep learning-based modulo video recovery method employing a selective spatiotemporal transformer architecture.
Findings
SSViT achieves state-of-the-art performance in modulo video reconstruction.
The token selection strategy improves computational efficiency and focus.
High-quality reconstructions are possible from 8-bit folded videos.
Abstract
Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · CCD and CMOS Imaging Sensors · Advanced Image Processing Techniques
