Loading paper
Cross-modal Supervision for Learning Active Speaker Detection in Video | Tomesphere