Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention
Shiyang Zhou, Haijin Zeng, Yunfan Lu, Yongyong Chen, Jie Liu, Jingyong Su

TL;DR
This paper introduces TSANet, a lightweight two-stage neural network with state space augmented cross-attention for efficient and high-quality demosaicing of HybridEVS event camera data, outperforming previous methods.
Contribution
The paper proposes a novel lightweight two-stage network with state space augmented cross-attention for event-based demosaicing, improving accuracy and efficiency on mobile devices.
Findings
Outperforms DemosaicFormer in PSNR and SSIM across seven datasets.
Reduces parameter count by 1.86 times and computation by 3.29 times.
Demonstrates effective demosaicing on both simulated and real HybridEVS data.
Abstract
Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous "events" instead of frames, offering advanced application on mobile photography. However, challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application. Current methods struggle to address these issues, especially on resource-limited mobile devices. In response, we introduce \textbf{TSANet}, a lightweight \textbf{T}wo-stage network via \textbf{S}tate space augmented cross-\textbf{A}ttention, which can handle event pixels inpainting and demosaicing separately, leveraging the benefits of dividing complex tasks into manageable subtasks. Furthermore, we introduce a lightweight Cross-Swin State Block that uniquely utilizes…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
• The proposed RVSS module demonstrates effectiveness by significantly reducing model parameters while maintaining competitive performance levels. This efficiency can be particularly advantageous for resource-constrained environments, enabling the deployment of complex vision models on devices with limited computational power. The parameter reduction, achieved without notable sacrifices in accuracy or quality, highlights the RVSS module’s potential for scalability and its suitability for lightwe
• The paper's relevance to event-based vision is unclear, as it lacks components specifically tailored for processing event signals. There is no dedicated mechanism or module designed to leverage the unique properties of event-based input. This raises questions about the paper’s contributions to event-based vision specifically. • The paper does not detail the loss function or the two-stage training strategy, both of which are crucial for understanding the network’s optimization and performance
S1. The research on lightweight hybrid event camera demosaicing architectures holds significant potential for advancing the field of event cameras. In the experiments, the proposed TSANet-s markedly outperforms the SOTA in terms of performance while maintaining the lowest parameter count and complexity. S2. Integrating SSM with window attention is an effective approach, as it substantially reduces model complexity while balancing both global and local information. S3. The authors incorporated
W1: The combination of state-space models with attention does not appear enough novel. Additionally, the effects of the proposed QCSA and SPA as shown in the ablation study are minimal. W2: The paper lacks an in-depth discussion of integrating the Quad Bayer pattern's positional information. This aspect should be one of the primary focus. W3: On Page 4, Line 215, previous studies have shown that pretraining sub-networks can improve performance and inference stability, yet there is no citation
* The proposed two-stage network structure design seems to be effective in handling such kind of data. * The proposed method requires less number of parameters, which could be efficient in deploying on limited-resource mobile devices.
* The motivation of proposing a two-stage network structure design is not that clear. As shown in Line200, the authors say that "all-in-one models often struggle to extract the inner connection between position and color", but do not provide any explanation or proof. They only let the readers to see the experimental results in Fig.6. It seems that it is more like story-telling (\eg, "in our experiments we found that doing xx could be better than doing yy") instead of giving in-depth analysis on
The paper explores a new and important research topic: demosaicing for Hybrid Event-based Vision Sensors (HybridEVS). This is a valuable area of study given the increasing interest in event vision (MIPI Demosaic 2024). The proposed TSANet introduces a lightweight network, which has the potential to be applied on mobile devices. However, the authors have not conducted experiments to validate its performance on edge computing (challenges iii).
Please refer to Summary
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Image and Signal Denoising Methods · Industrial Vision Systems and Defect Detection
