TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual   Video Parsing

Yaru Chen; Peiliang Zhang; Fei Li; Faegheh Sardari; Ruohao Guo; Zhenbo; Li; Wenwu Wang

arXiv:2505.02096·cs.MM·May 6, 2025

TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing

Yaru Chen, Peiliang Zhang, Fei Li, Faegheh Sardari, Ruohao Guo, Zhenbo, Li, Wenwu Wang

PDF

Open Access

TL;DR

TeMTG introduces a multimodal framework that enhances audio-visual video parsing by integrating text embeddings and multi-hop temporal graph modeling, leading to improved event detection accuracy under weak supervision.

Contribution

The paper proposes a novel multimodal optimization framework combining text enhancement with multi-hop temporal graph neural networks for better AVVP performance.

Findings

01

Achieves state-of-the-art results on LLP dataset.

02

Effectively models temporal relationships between segments.

03

Enhances semantic feature representations with text embeddings.

Abstract

Audio-Visual Video Parsing (AVVP) task aims to parse the event categories and occurrence times from audio and visual modalities in a given video. Existing methods usually focus on implicitly modeling audio and visual features through weak labels, without mining semantic relationships for different modalities and explicit modeling of event temporal dependencies. This makes it difficult for the model to accurately parse event information for each segment under weak supervision, especially when high similarity between segmental modal features leads to ambiguous event boundaries. Hence, we propose a multimodal optimization framework, TeMTG, that combines text enhancement and multi-hop temporal graph modeling. Specifically, we leverage pre-trained multimodal models to generate modality-specific text embeddings, and fuse them with audio-visual features to enhance the semantic representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Cancer-related molecular mechanisms research