Loading paper
Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder | Tomesphere