Loading paper
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization | Tomesphere