Loading paper
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning | Tomesphere