Loading paper
Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels | Tomesphere