CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
Gyuwon Han, Young Kyun Jang, Chanho Eom

TL;DR
This paper introduces CoVA, a new task and benchmark for retrieving videos based on combined visual and audio modifications specified by text, addressing limitations of existing visual-only retrieval methods.
Contribution
It proposes AVT, a novel multimodal fusion method, and AV-Comp, a benchmark dataset for audio-visual composed video retrieval, advancing multimodal retrieval research.
Findings
AVT outperforms unimodal fusion baselines.
AV-Comp enables evaluation of audio-visual retrieval.
The dataset includes cross-modal change queries.
Abstract
Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Music and Audio Processing
