Loading paper
Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection | Tomesphere