Multi-modal Collaborative Optimization and Expansion Network for Event-assisted Single-eye Expression Recognition
Runduo Han, Xiuping Liu, Shangxuan Yi, Yi Zhang, Hongchen Tan

TL;DR
This paper introduces a novel multi-modal network that leverages event data to improve single-eye expression recognition, especially in challenging lighting conditions, by integrating innovative optimization and expert collaboration techniques.
Contribution
The paper presents a new multi-modal network with two key components, MCO-Mamba and HCE-MoE, enabling effective fusion and collaboration of modalities for expression recognition.
Findings
Achieves competitive accuracy in low-light conditions
Effectively fuses multi-modal information for better recognition
Demonstrates robustness against challenging lighting scenarios
Abstract
In this paper, we proposed a Multi-modal Collaborative Optimization and Expansion Network (MCO-E Net), to use event modalities to resist challenges such as low light, high exposure, and high dynamic range in single-eye expression recognition tasks. The MCO-E Net introduces two innovative designs: Multi-modal Collaborative Optimization Mamba (MCO-Mamba) and Heterogeneous Collaborative and Expansion Mixture-of-Experts (HCE-MoE). MCO-Mamba, building upon Mamba, leverages dual-modal information to jointly optimize the model, facilitating collaborative interaction and fusion of modal semantics. This approach encourages the model to balance the learning of both modalities and harness their respective strengths. HCE-MoE, on the other hand, employs a dynamic routing mechanism to distribute structurally varied experts (deep, attention, and focal), fostering collaborative learning of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The MCO-E Net achieves competitive performance on the single-eye expression recognition task while attaining the fastest inference speed as shown in Section 4.1 and A.3.
1. It would be helpful to clarify the necessity and significance of incorporating event-based data specifically for the eye-related facial expression recognition task. Additionally, it should be explicitly stated which aspects of the method are specifically designed to cater to the challenges of eye-related facial expression recognition. 2. The proposed network structure lacks significant innovation. For instance, in the introduction of MCO-Mamba, MJOS derives the BC parameters of SSM by concat
1) The use of event cameras for illumination-robust eye expression recognition is motivated by practical considerations. 2) The paper includes component and hyperparameter ablations. 3) Despite model complexity, efficiency analysis shows reasonable speed.
1) The experiments were conducted on two corpora only (SEE and DSEE). The method's practical applicability is called into question by the lack of verification on independent, in-the-wild corpora. 2) No experiment tests the transferability of the model (e.g. training on SEE and testing on DSEE, or on a third-party corpus). The latter is more important. 3) The MJOS proposal essentially involves jointly projecting parameters from two modalities, followed by residual addition. This is technically si
1. The combination of Mamba and heterogeneous MoE for multimodal feature fusion is creative and technically interesting. 2. Experimental results are strong, showing consistent improvements over state-of-the-art methods across all lighting conditions with very low inference latency. 3. The motivation is clear and addresses a real challenge of semantic misalignment between RGB and event modalities. 4. Ablation studies are comprehensive, verifying the contribution of each component and parameter ch
1. Although the model is complex from an engineering perspective, there is limited theoretical analysis of the convergence and dynamic equilibrium mechanisms of the MCO-Mamba joint optimization. 2. Mamba-based multimodal fusion shares similarities with recent approaches (e.g., Sigma, MSFMamba, DepMamba, 2024–2025), where the innovations primarily arise from module combinations rather than fundamentally new principles. 3. The validation is limited to two similar monocular datasets, without cross-
1. Experiments Experiments on the SEE and DSEE datasets demonstrate that MCO-E Net outperforms existing methods by 2–4% in both WAR and UAR, showing greater robustness under different illumination settings. Ablation studies further verify the individual contributions of MCO-Mamba and HCE-MoE, as well as the influence of expert number and Top-k routing. 2. Architectural design The two complementary modules (MCO-Mamba for bidirectional SSM-based fusion; HCE-MoE for heterogeneous expert routing wi
1. Generalization scope Evidence is limited to SEE/DSEE; cross-dataset, cross-subject, and cross-device evaluations (different event sensors/cameras) are not demonstrated. 2. Clarity of components Some module definitions (e.g., gating reductions, symbol shapes, interaction operators) need explicit dimensionality and implementation details to ensure exact replication. 3. Ablation coverage Missing or light on key controls such as RGB-only vs Event-only, bidirectional vs unidirectional SSM, route
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Gaze Tracking and Assistive Technology · Visual Attention and Saliency Detection
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
