TL;DR
FaceLiVTv2 is a new hybrid architecture that improves mobile face recognition by enhancing efficiency and accuracy through lightweight global-local feature interaction modules, achieving significant latency reductions.
Contribution
The paper introduces Lite MHLA and a unified RepMix block, advancing hybrid CNN-Transformer models for efficient mobile face recognition with better performance and lower latency.
Findings
Reduces mobile inference latency by 22% compared to FaceLiVTv1.
Achieves up to 30.8% speedup over GhostFaceNets on mobile devices.
Maintains higher recognition accuracy while improving latency by 20-41%.
Abstract
Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
