AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
Jongsuk Kim, Jiwon Shin, Junmo Kim

TL;DR
AVCap introduces a novel audio-visual captioning framework that uses audio-visual features as text tokens, improving performance and scalability in generating human-like descriptions.
Contribution
The paper presents AVCap, a new baseline approach that leverages audio-visual features as text tokens, exploring encoder architectures, pre-trained model adaptation, and modality fusion.
Findings
AVCap outperforms existing methods across all metrics.
The approach enhances scalability and extensibility of captioning models.
Utilizes pre-trained models and modality fusion effectively.
Abstract
In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Video Analysis and Summarization
