Grounding is All You Need? Dual Temporal Grounding for Video Dialog
You Qin, Wei Ji, Xinze Lan, Hao Fei, Xun Yang, Dan Guo, Roger, Zimmermann, Lizi Liao

TL;DR
This paper introduces DTGVD, a novel video dialog model that integrates dual temporal grounding and contrastive learning to improve response relevance by capturing nuanced temporal dynamics in videos and conversations.
Contribution
It proposes a dual temporal grounding approach combined with contrastive learning, enhancing video dialog response generation by better modeling temporal relationships.
Findings
Outperforms existing models on AVSD@DSTC-7 and AVSD@DSTC-8 datasets.
Effectively captures temporal dependencies in video and dialog contexts.
Improves response relevance through dual temporal grounding and contrastive learning.
Abstract
In the realm of video dialog response generation, the understanding of video content and the temporal nuances of conversation history are paramount. While a segment of current research leans heavily on large-scale pretrained visual-language models and often overlooks temporal dynamics, another delves deep into spatial-temporal relationships within videos but demands intricate object trajectory pre-extractions and sidelines dialog temporal dynamics. This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD), strategically designed to merge the strengths of both dominant approaches. It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions, filtering video content accordingly, and grounding responses in both video and dialog contexts. One standout feature of DTGVD is its heightened attention to chronological interplay. By…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The idea of temporally grounding both relevant dialog turns and relevant video frames makes sense. - The proposed method shows superior performance on AVSD@DSTC-7/DSTC-8, outperforming pervious approaches. - The ablation studies and in-depth analysis validate the idea and model design choices.
The paper is overall good, but I have the following concerns: - Recently, many video multimodal large language models (VideoLLMs) have been proposed, including Video-LLaVA [1], SeViLA [2] and LLaMA-VID [3], Video-ChatGPT [4] and Video-LLaMA. Please compare with them, if possible. - Are there any other video dialog datasets that are sourced from different video distribution? Evaluating on other datasets such as VSTAR [5] would make the paper more powerful. - In section 4.4 (L423-425), the author
1. It introduces an innovative approach with the Dual Temporal Grounding-enhanced Video Dialog (DTGVD) model, which leverages dual temporal dynamics inherent in both video sequences and dialog histories. 2. This model employs a temporal grounding module to explicitly model the attention shift of each dialog turn over the video, generating temporal masks to filter out irrelevant video frames and dialog history.
1. The motivation is not convincing enough. Why the temporal information in the dialog history is needed? Existing works can also model the video temporal structure and answer the question. 2. the baseline models are too old and they are not the latest SOTA. There are some new works for example: [1] HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue [2] M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation The grounded QA methods should
1. This paper proposes a Dual Temporal Groundingenhanced Video Dialog model to achieve SOTA performance on the video dialog generation task. 2. This paper proposes comprehensive experiments to prove the effectiveness of this method.
1. The writing of this paper needs further improvement. For example, the abstract contains several sentences that lack clarity, and there is a lack of coherence between the different parts of the abstract. 2. The selection of dialog history relies on the timesteps of each QA pair. In order to be applicable in real-world scenarios, the timestamps of this QA should not be considered accessible information for model inference. 3. The contrastive learning method is simply utilized to optimize the te
1.The idea of the paper is easy to understand. 2.The presentation of the method section is relatively detailed.
1.The idea of this article is similar to [1] in that both are trying to mine meaningful historical information. Can you clarify the difference between the two? And the advantages and disadvantages? 2.Too many recent baseline methods in this field are not cited or compared, e.g., ISR[1], PDC[2], RLM[3], T5RLM[4], JVIT[5], and DialogMCF[6]. Also, why are the baseline results in your paper inconsistent with the results in other papers? For example, SCGA obtains 0.745 in BLEU-1 in [1] vs 0.702 in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsSoftmax · Attention Is All You Need · Contrastive Learning
