AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Jongsuk Kim; Jiwon Shin; Junmo Kim

arXiv:2407.07801·eess.AS·July 12, 2024

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Jongsuk Kim, Jiwon Shin, Junmo Kim

PDF

Open Access 1 Repo

TL;DR

AVCap introduces a novel audio-visual captioning framework that uses audio-visual features as text tokens, improving performance and scalability in generating human-like descriptions.

Contribution

The paper presents AVCap, a new baseline approach that leverages audio-visual features as text tokens, exploring encoder architectures, pre-trained model adaptation, and modality fusion.

Findings

01

AVCap outperforms existing methods across all metrics.

02

The approach enhances scalability and extensibility of captioning models.

03

Utilizes pre-trained models and modality fusion effectively.

Abstract

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jongsuk1/avcap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Video Analysis and Summarization